# ML on FPGAs - Exercise 1

Becoming familiar with InAccel tools and accelerating our first application. Naive Bayes classification acceleration and hyperparameter tuning on Logistic Regression Training.

# Introduction

In this lab you are going to create your first accelerated application and then use scikit learn to find out the speedup you get upon running Naive Bayes algorithm using the original (CPU) and FPGA implementation. Further on, you are going to use hyperparameter tuning to get the best model for a Logistic Regression application.

A jupyterhub environment is going to be used that is deployed on Kubernetes and has access to 2 Xilinx Alveo (U250 and U280) and 2 Intel PAC A10 FPGA boards. Just for the first time, each team will signup to https://labs.inaccel.com:8080/hub/signup using its credentials. From then and on you will be using the following login link: https://labs.inaccel.com:8080/hub/login. You can also use InAccel's monitoring tool at any time to get an overview of the FPGA running tasks: http://labs.inaccel.com:8081/.

As already discussed in the theoretic part, creating the final bitstream that is used to program the FPGA is a very time consuming procedure and requires broad knowledge of hardware programming and designing. For now we will skip that part and we will stick to the case that we have already compiled a bunch of accelerators (libraries for us) which we will use to implement our applications.

# Creating a Bitstream Artifact

First of all we are going to get the metadata from a raw bitstream and create a bitstream artifact installable to InAccel Coral FPGA Resource Manager. In fact, this is the last step of a hardware developer's workflow. The hardware developer has already compiled and linked his accelerator and now he is going to create the packaging format for the bitstream to be deploy-able to any bitstream repository and any Coral instance. The developer is the one that knows better the implemented design as well as any connections to memory interfaces etc.

• Your home directory is your persistent volume. There you can store any files that yout want to save for later use.

• Open a Terminal.

• Copy shared/lab1 folder to your home directory (e.g. cp -r ~/shared/lab1 ~ ).

• Navigate to your own copy of lab1 folder.

• Under bitstream folder execute the following to downloadvector_demo.aocx file, which is the actual accelerator bitstream file used to program the FPGA device:

• Use the following command to generate a bitstream.json file.

BITSTREAM="vector_demo.aocx" ; (aocl binedit ${BITSTREAM} get .acl.fpga.bin /tmp/fpga.bin && \ aocl binedit /tmp/fpga.bin print .acl.gbs.gz | gunzip | strings | sed -e 's/\\$//' -e '3!d' && \
rm /tmp/fpga.bin ; aocl binedit \${BITSTREAM} print .acl.kernel_arg_info.xml) | \
inaccel parse -o bitstream.json
• Edit bitstream.json file (e.g. nano bitstream.json) adding the following:

• name: vector_demo.aocx

• bitstreamId: com.inaccel.math.vector

• version: 1.0

• description (optional): Vector addition and subtraction

### List all the installed bitstreams in your system

inaccel bitstream list

### Install artifact to InAccel

inaccel bitstream install .

### List again the installed bitstreams

inaccel bitstream list

The bitstream installed by each team won't be visible to the others. So in this case even if more than one users invoke an accelerator with the same name, each one will use his own instance of the accelerator and the FPGA will be reconfigured accordingly.

Later on, we will inspect a case where someone can install a bitstream visible to other users in order to enable the sharing of some accelerators.

### View the details of the specific bitstream you just installed adding its checksum to the previous command.

You can obtain the checksum from the output of the command above

inaccel bitstream list <checksum-from-above>

What do you get? Add the output you got to your report. How many kernels are they in this bitstream? Are there any replicas of the same kernel and can you explain the purposes?

# Running the first FPGA accelerated application

Now that you have installed a function library let's write a simple python application to verify its correctness.

• Press Ctrl + Shift + L to open a "New Launcher".

• Create a new Python notebook.

• First of all import inaccel coral api and numpy

import inaccel.coral as inaccel
import numpy as np
• Next you are going to create four vectors. Vector a and b will be the ones to be added and subtracted respectively.

Note that inaccel array extends numpy ndarray using its custom allocator for buffers so the way to instantiate an array is the same one you would use with numpy.

#make sure that you create a scalar with numpy
size = np.int32(1024 * 1024)
# Allocate four vectors & Initialize input vectors with random values
a = inaccel.array(np.random.rand(size), dtype = np.float32)
b = inaccel.array(np.random.rand(size), dtype = np.float32)
c_add = inaccel.ndarray(size, dtype = np.float32)
c_sub = inaccel.ndarray(size, dtype = np.float32)
• Now it is time to create a request for addition:

# Send a request for "addition" accelerator to the Coral FPGA Resource Manager
# Request arguments must comply with the accelerator's specific argument list
• And another one for subtraction:

# Send a request for "subtraction" accelerator to the Coral FPGA Resource Manager
# Request arguments must comply with the accelerator's specific argument list
vsub = inaccel.request("com.inaccel.math.vector.vsub-kernel")
vsub.arg(a).arg(b).arg(c_sub).arg(size)
inaccel.wait(inaccel.submit(vsub))
• Finally check the output vectors to validate that FPGA computed and returned the right results.

# Check output vectors
valid = True
if not np.array_equal(c_add, a + b):
valid = False
if not np.array_equal(c_sub , a - b):
valid = False
if valid:
print('Results: RIGHT!')
else:
print('Results: WRONG!')
• Did you get the right results? Can you modify the code to measure the time taken for each request?

You can use %timeitmagic command to measure the execution time of a command in a Python Notebook.

Alternatively, you can always use python time()function to measure the time elapsed.

• Example:

%timeit
time()
%timeit
%timeit i = 0
time()
from time import time
start_time = time()
...
elapsed_time = int((time() - start_time) * 1000) / 1000
print(elapsed_time)
• Modify the code to also measure the time for adding and subtracting the two vectors using the CPU. You can store the result in new arrays e.g. cpu_add = a + b

• Measure the speedup using the known formula $(S=T/Ta)$ where Ta is the time of the FPGA execution . Do you actually have a speedup on the execution of each operation (addition or subtraction)? What could be the reason in case you don't?

• Input data needs to be transferred to the FPGA board DDR to be computed and then the results to be returned back to the host since the OS and the FPGA don't share the same memory.

• The first time that you send a request for an accelerator, if the FPGA isn't configured with that bitstream, it needs to be reconfigured so you may observe a much higher execution time.

# Scikit-Learn on FPGAs

## Naive Bayes Example

In this section you are going to compare the execution of the original Scikit-Learn Naive Bayes algorithm with the FPGA accelerated one. From the jupyterhub dashboard navigate to lab1/notebooks folder and open NaiveBayes.ipynb. The administrator of the lab has already installed any necessary bitstreams publicly so that all of the teams can have access to the same accelerator for running the scikit-learn examples. You can verify that by issuing an inaccel bitstream list command.

The accelerator has some limitations as described below:

1. Max number of classes = 64

2. Max number of features = 2048

1. Run all the cells and inspect the outputs

2. Did the FPGA classifier returned the results the original one (CPU) returned?

3. Run again the cells and calculate the speedup you get for all the possible combinations of the following configurations (total configurations: 9):

1. samples: 100000

2. features: 500, 1000, 2000

3. classes: 10, 35, 60

4. In which case of the above ones did you get the highest speedup? Can you explain why?

5. In your report create two charts:

1. one for the speedup you get depending on the number of features for 100000 samples and 60 classes

2. and a second one for the speedup you get depending on the number of classes for 100000 samples and 2000 features

6. Does it better make sense to use this specific FPGA accelerator on a dataset containing a lot of classes or features?

## Logistic Regression Example

In the previous section we accelerated the classification part of Naive Bayes algorithm using scikit-learn and InAccel Python API and took metrics on the speedup observed.

In this section we are going to use again scikit-learn but this time we will focus on accelerating the training part of another widely used machine learning algorithm, Logistic Regression. We have created a notebook that is used to train and apply many accelerated sklearn models, with a k-fold cross validation and hyperparameter tuning step. In machine learning, hyperparameter optimization or tuning is the problem of choosing a set of optimal hyperparameters for a learning algorithm. A hyperparameter is a parameter whose value is used to control the learning process.

From the jupyterhub dashboard open LogisticRegression.ipynb file. The administrator of the lab has already installed any necessary bitstreams publicly so issuing an inaccel bitstream list command will reveal any installed Logistic Regression bitstreams. Logistic Regression (LR) training takes several hyperparameters which can affect the accuracy of the learned model. There is no "best" configuration for all datasets. To get the optimal accuracy, we need to tune these hyperparameters based on our data.

The accelerator has some limitations as described below:

1. Max number of classes: 64

2. Max number of features: 2047

Objectives:

1. Run all the cells and observe the speedup you get compared to the software execution.

2. Focus only on the the FPGA accelerated part. Change the parameters grid and add the following:

1. max_iter: 50, 100

2. l1_ratio: 0.3, 0.9

3. Re-run the cells related to the FPGA accelerated training.

4. Which combination of max_iter, l1_ratio provides the best model in terms of accuracy?