Neuralhardware

FPGA Based Neural Networks

Download .zip Download .tar.gz View on GitHub

Neural Hardware: FPGA-based Neural Networks

Darrin Willis (dswillis) and Bohan Li (bohanl)

Summary

We will be investigating an implementation of Neural Networks into a low-energy FPGA implementation. Neural Networks are a common machine learning algorithm with a high potential for parallelization, which can be exploited by hardware. This energy efficient neural network is perfect for mobile devices.

Background

An artificial neural network is a statistical learning algorithm involving layers of nodes, called perceptrons, which process information in a way that approximates an unknown function. Starting from an input layer, information is filtered, modified, and passed down through a series of hidden layers until reaching the final output layer. One of the major uses of neural networks involves their ability to discern the underlying function connected a set of input values to a set of output values.

Our goal will be to train a neural network with one hidden layer. Each input node will send it’s weighted input to all nodes in our hidden layer. The hidden layer nodes will then apply a sigmoid function 1/(1+e^(-x)), where x is the sum of their corresponding inputs. Finally, these values are sent to the single output node in the same fashion and modified to return a meaningful output.

The key to this training process lies in the set of weights given to the input values. We will be configuring these weights based on training data with known outputs for certain inputs. The weight calculations will be done using the backpropagation algorithm. This algorithm essentially iterates through the training dataset and alters the weights until we reach a certain error threshold.

Sequential implementation of Backpropagation algorithm:

while(previous error > error) {
  for(each test datapoint) {
    for(each hidden node) {
      sum = 0;
      for(each input node)
        sum += weight * input;
      hidden_value = 1/(1+Math.exp(-sum));
    }
    sum = 0;
    for(each hidden node)
      sum += weight * hidden_value
    output_value = 1/(1+Math.exp(-sum));

    error += (output_value - true_value)^2;
    output_weight_err = error caused by output weight;

    for(each hidden node) {
      hidden_weight_err = error caused by hidden weight;
      for(each input node)
        new_hidden_weight = old weight + weight change;
      new_output_weight = old weight + weight change;
    }
  }
}

As seen from the pseudocode above, there are a few clear avenues for parallelism in the algorithm. Specifically, the weighted sum and error update portions can be easily parallelized. Although, parallelization with respect to test data points does not immediately follow from this code, we plan to modify the algorithm to accommodate.

Challenge

Designing hardware to solve any problem is frequently a more challenging way to develop a computer solution to a problem. The main challenge in this space will be porting a Neural Network solver to the System Verilog hardware description language. The solver will likely utilize some interesting hardware algorithms for pipelining the processes to make maximum use of the hardware.

Resources

We will be utilizing standard tools for live communicating with a host machine, which will include FPGA specific hardware modules and potentially some PC-side libraries for the communication. Along the way, we will likely also make use of some

5/8 Update

We have designed a neural network on the FPGA. One unforseen challenge was designing out the method to represent floating point values on the FPGA. Using standard floating point representation would introduce performance issues involving basic arithmetic operations. Instead, we use a custom fixed point representation of floats involving 1 sign bit, 15 integer bits, and 16 floating bits. Addition can be done the same way as integer addition and multiplication can be done using integer multiplication combined with a right shift.

Our neural network is represented using 4 matrices: input, hidden_output, hidden_weights, and output_weights. Running the backpropagation algorithm for a single training example takes a total of 11 clock cycles, with a matrix operation being computed during each cycle. This reduces the total number of cycles taken by a significant amount when compared with the PC side implementation.

One application of this FPGA program is a matchmaking app designed for a smart watch. The energy and performance upgrades fit the needs of an on-demand mobile application. When activated, the smart watch will read the tone, body language, proximity, and eye contact of the person you are speaking to. These 4 inputs are passed in as numerical values into the FPGA. The output will be a yes or no answer to the question of whether or not the person is interested in you. Our evaluation plan involves first generating a function that takes in these 4 inputs and gives a yes or no output. Then, we randomize the values for each attribute and apply the function, giving us our training examples. Our test examples will be generated the same way, and we will measure how accurately our FPGA modeled the function.

Goals and Deliverables

Plan to Achieve

PC-side visualization and analysis of performance live with the FPGA running
PC-standalone implementation of Neural Network application to compare benchmarks
(includes all below)

In depth analysis of hardware size, speed, power, and complexity
Alternatively, analysis of why Neural Networks are not a good fit for FPGA implementation

Full stack implementation of Neural Network application on FPGA and PC heterogeneous computing environment

Hope to Achieve

Interesting application of neural networks (AI, image processing, etc.)
Different hardware implementations to compare hardware stacks

Platform Choice

FPGAs are the best hardware prototyping tool and are more popular than ever, with the new focus on solving problems on energy sensitive devices. Machine learning has become a mature field, and its algorithms are becoming only more commonplace. It is foreseeable that mobile devices could have an integrated ML chip in them in order to open up a whole new area of applications. This project aims to explore a specific case of these algorithms in the form of Neural Networks.

Checkpoint - Software Side

The software side of this project revolves around tuning the neural network implementation on the computer. We have written and tested a working parallel implementation of a neural network with one hidden layer. The implementation parallelizes over nodes in each layer. The concurrency comes from the weighted sums calculated using matrix multiplication. We used the openmp abstraction in order to implement this.

We have decided to use a song popularity classification dataset to test our implementation. The reason is because the final FPGA implementation could be embedded into a cell phone and used as a tool to determine whether or not a song will be popular. Specifically, the input is an array of song properties such as length, bpm, etc. and the output is a number representing how popular the song will be. The neural network will be trained using historical data from the past few years.

We have finished coding the bulk of the PC-side part of the project and plan to tune the parameters passed into the neural network before publishing analysis. In addition, we have only tested the implementation using mock datasets, so we are still unsure of whether or not it will end up being a useful application.

The most glaring issue is that our simple neural network may not be able to produce accurate results. During the process of tweaking parameters, we may realize that we have chosen an inappropriate dataset. If this ends up being the case, we will switch to a simpler one: taking in exam grades and calculating the final average. The design of our neural network should make this task easy, however there are less feasible applications that could be around this.

In terms of the software side, the only goal left is the final tuning and analysis. This should be completed by Bohan Li within the week. Afterwards, both partners will be working on porting the code to hardware.

Checkpoint - Hardware Side

The hardware platform has been progressing along mostly well. The schedule is currently being roughly met, and does not need to be changed, as it currently reflects progress.

The chosen FPGA was the Altera DE2-115, which uses the popular Quartus toolchain, which is standard throughout CMU hardware courses. The DE2-115 board has an ethernet port, and has demo projects showcasing how to utilizes the board as a web server, which will hopefully make the process of host to board communication more seamless.

Ethernet will be used as the standard communication protocol for communicating with the FPGA, but the exact nature of the protocol has yet to be determined. It will likely be a several phase process with a minimum of a training phase and a testing phase, during which the board will send different kinds of messages to the host machine.

The actual hardware design has not yet begun, but is about to begin, as is in accord with the schedule. The algorithm seems like it will likely be a relatively simple datapath, with a large amount of matrix multiplication, but no other significant components. A third party IP blob will likely be utilized for the ethernet communication, and currently a library is in the process of being determined.

Schedule

Week of 4/1 to 4/7

Investigate different available FPGAs and select one
Investigate PC host environment for FPGAs and what is the best way to interface with FPGA
Research Neural Networks to gain an initial idea of the hardware algorithm

Week of 4/8 to 4/14 (Project Checkpoint)

Figure out modules for communicating with FPGA and take care of as much boilerplate as possible
Pick a specific Neural Network application to implement on the FPGA

Week of 4/15 to 4/21

Begin implementing FPGA Neural Network algorithm
Begin PC-side Neural Network algorithm
Fix any remaining communication infrastructure

Week of 4/22 to 4/28

Continue developing FPGA algorithm
Continue developing PC algorithm
Begin analysis and testing harness for both implementations

Week of 4/29 to 5/5

Finish FPGA algorithm
Finish PC algorithm
Create analysis and contrast different hardware implementations

Week of 5/6 to 5/11 Parallelism Competition

Finish up any remaining analysis
Fix up anything which wasn't yet done