Deep Learning Project in C, C++ or Fortran#

The goal of this project is to implement and test a deep learning algorithm from scratch in either C, C++, or Fortran.

You will implement a simple feedforward neural network with backpropagation for training. It should include the following components:

  • A fully connected neural network with at least one hidden layer

  • An activation function (e.g., sigmoid, ReLU)

  • A loss function (e.g., mean squared error)

  • A backpropagation algorithm for training

  • Gradient descent for updating weights

Introduction#

Deep learning is a subset of machine learning that uses neural networks with multiple layers to model complex patterns in data. Implementing a deep learning algorithm from scratch can help you understand the underlying mechanics of how these models work. Machine Learning in general is about teaching computers to learn a “function”, ie., something that given a certain input gives a corresponding output. This might be something as simple as a linear function, or something more complex like a deep neural network. In the case of linear regression, the input is a bunch of (x, y) points, and the output is a line that best fits those points, where “best” needs to be defined (typically, by minimizing the mean squared error). In this case, we know the functional form of the model (a line \(y = mx + b\)), and we just need to find the parameters (\(m\) and \(b\)) that minimize the error.

In the case of a deep neural network, we don’t know the functional form of the model, and we need to learn both the parameters and the structure of the model. This is where neural networks come in. A neural network is a collection of interconnected nodes (neurons) that can learn to approximate complex functions. Each neuron takes in a weighted sum of its inputs, applies an activation function, and produces an output. By stacking multiple layers of neurons, we can create a deep neural network that can learn to model complex patterns in data.

If you choose this project, it will help if you have some prior understanding of neural networks – but you can also learn as you go. The key is to break down the problem into smaller pieces and implement each piece step by step. While I’m giving a high-level overview here, you will need to dive into the details of how to implement each component. For example, you will need to decide on the architecture of your neural network (number of layers, number of neurons per layer), choose an activation function, and implement the backpropagation algorithm correctly.

Here are two sample problems you can try to solve with your neural network – they’re both classification problems. The first one is a classic ML problem based on the “wheat seeds” dataset, which is a simple dataset that contains measurements of different types of wheat seeds. This problem is a good starting point because it’s relatively well defined, and it will allow you to test your neural network implementation on a real dataset.

The second problem is more helpful for conceptual understanding of different aspects of deep learning, and it’s basically to implement the Tensorflow playground in C, C++ or Fortran. The TensorFlow playground is an interactive visualization of a simple neural network that allows you to experiment with different architectures and activation functions. Implementing this from scratch will give you a deeper understanding of how neural networks work and how different components interact with each other. You can find the TensorFlow playground here: https://playground.tensorflow.org/

You can choose either of these problems to implement your neural network. In fact, if you do one, you’ll have most of the pieces in place to do the other. Of course you can also go beyond, too.

Neural Network Architecture#

We’ll stick with a simple feedforward neural network architecture for this project. A feedforward neural network consists of an input layer, one or more hidden layers, and an output layer. Each layer is made up of neurons that are connected to the neurons in the previous and next layers. The input layer receives the input data, the hidden layers process the data, and the output layer produces the final output. The connections between neurons have weights that are adjusted during training to minimize the loss function.

Wikipedia shows a simple feedforward neural network architecture with one hidden layer. The input layer has three neurons, the hidden layer has five neurons, and the output layer has two neurons. The connections between the layers are represented by arrows, and the weights of these connections are adjusted during training to minimize the loss function. As you can see each output from a given layer is connected to every neuron in the next layer, which is why it’s called a “fully connected” neural network.

A given neuron receives \(n\) inputs, where \(n\) is the number of neurons in the previous layer. Each input is multiplied by a corresponding weight, and the weighted inputs are summed together to produce a single value. An additional weight (bias) is added to this sum. This value is then passed through an activation function to produce the output of the neuron. The activation function introduces non-linearity into the model, which allows it to learn more complex patterns in the data.

Activation Functions#

Activation functions are mathematical functions that determine the output of a neuron given its input. They introduce non-linearity into the model, which allows it to learn more complex patterns in the data. There are several common activation functions used in neural networks, including:

  • Sigmoid: The sigmoid function maps any input to a value between 0 and 1. It is defined as \(f(x) = \frac{1}{1 + e^{-x}}\). The sigmoid function is often used in the output layer of binary classification problems, where the output represents a probability.

  • ReLU (Rectified Linear Unit): The ReLU function is defined as \(f(x) = \max(0, x)\). It is commonly used in hidden layers of neural networks because it helps to mitigate the vanishing gradient problem and allows for faster training.

  • Tanh (Hyperbolic Tangent): The tanh function maps any input to a value between -1 and 1. It is defined as \(f(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}\). The tanh function is often used in hidden layers of neural networks because it centers the data around zero, which can help with training.

Loss Function#

The loss function is a mathematical function that measures the difference between the predicted output of the neural network and the actual target values. The goal of training a neural network is to minimize the loss function by adjusting the weights of the connections between neurons. There are several common loss functions used in neural networks, including:

  • Mean Squared Error (MSE): The MSE loss function is defined as \(L = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2\), where \(y_i\) is the actual target value and \(\hat{y}_i\) is the predicted output. The MSE loss function is commonly used in regression problems.

  • Cross-Entropy Loss: The cross-entropy loss function is defined as \(L = -\frac{1}{n} \sum_{i=1}^{n} [y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i)]\), where \(y_i\) is the actual target value and \(\hat{y}_i\) is the predicted output. The cross-entropy loss function is commonly used in binary classification problems.

Training: Backpropagation Algorithm#

The backpropagation algorithm is a method for training neural networks by adjusting the weights of the connections between neurons to minimize the loss function. The algorithm consists of two main steps: the forward pass and the backward pass.

In the forward pass, the input data is passed through the neural network to produce a predicted output. The loss function is then calculated based on the predicted output and the actual target values. The goal of training is to minimize this loss function by adjusting the weights of the connections between neurons. So the question more or less becomes: If I increase this weight, or decrease that one, how does that affect the loss function? The backpropagation algorithm uses the chain rule of calculus to compute the gradients of the loss function with respect to each weight in the network. These gradients are then used to update the weights in the direction that minimizes the loss function. The update is typically done using gradient descent, which is an optimization algorithm that iteratively adjusts the weights based on the computed gradients.

Testing#

After training the neural network, you will need to test its performance on a separate dataset that was not used during training. This is important to evaluate how well the model generalizes to new data. Usually, you would split your dataset into a training set and a test set. The training set is used to train the model, while the test set is used to evaluate its performance using some appropriate metric.

Implementation Steps#

Prepare your dataset#

If you do the wheat seeds classification problem, you will need to prepare the dataset by loading it into your code. This dataset is available here: https://archive.ics.uci.edu/dataset/236/seeds. The dataset contains measurements of different types of wheat seeds, and the task is to classify the seeds into different categories based on these measurements.

You will need to read the dataset from the file – if you use xtensor, you can try to use its CSV reading capabilities, or you can write your own code to read the data. You may have to preprocess the downloaded data to get it into a format compatible with whatever you’re using to read it.

If you use the TensorFlow playground problem, you can generate your own datasets similar to the ones you see by appropriately drawing from different distributions. For example, you can generate a dataset of points in 2D space where the points are drawn from two different Gaussian distributions, and the task is to classify which points belong to which distribution.

Define the architecture of your neural network (number of layers, number of neurons per layer)#

You should leave yourself some flexibility here to experiment with different architectures. The tensorflow playground gives the user a lot of flexibility in not only defining the number of hidden layers, but also the number of neurons in each layer, as well as the activation function used. You might not need all of that flexibility, but you should at least be able to define the number of hidden layers and the number of neurons in each layer. For the seeds problem, the output layer should have as many neurons as there are classes (types of seeds), and the input layer should have as many neurons as there are features (measurements). The number of hidden layers and the number of neurons in each hidden layer can be chosen based on experimentation and the complexity of the problem.

In any case, it’s definitely not the goal to implement a nice GUI for this, so you can hardcode the architecture of your neural network in your code, or you could read it in from a configuration file. The key is to make it easy to change the architecture so that you can experiment with different configurations.

Initialize the weights and biases of the connections between neurons#

You will need to initialize the weights and biases of the connections between neurons before training the neural network. A common approach is to initialize the weights randomly, for example using a normal distribution or a uniform distribution. The biases can also be initialized randomly or set to zero. Proper initialization of weights can help with the convergence of the training process.

Implement the forward pass to compute the predicted output given an input.#

The forward pass involves taking the input data and passing it through the neural network to compute the predicted output. This involves computing the weighted sum of the inputs for each neuron, applying the activation function, and propagating the output to the next layer until you reach the output layer.

For the seeds problem, the sigmoid function is a common choice for the activation function, though of course it’s even nicer to be able to experiment with different activation functions. For the TensorFlow playground problem, you can see that it supports a variety of activation functions, but you don’t need to implement all of them.

Implement the loss function to compute the error between the predicted output and the actual target values.#

For the seeds problem, a simple approach is to just take the difference between the predicted output and the expected output – you can interpret the network’s output as the probability for each class, and you can compare that to the one-hot encoded true label. That is, the network might return something like [0.1, 0.7, 0.2], which means it thinks the probability of the seed being in class 1 is 0.1, class 2 is 0.7, and class 3 is 0.2. If the true label is class 2, then thef expected output would be [0, 1, 0]. The loss function would then measure the difference between these two vectors.

By default, it’s not guaranteed at all that the output will be a valid probability distribution (i.e., that the values will be between 0 and 1 and will sum to 1). With some amount of luck, it’ll work fine, anyway. An optional improvement is to ensure that the output of the network can be interpreted as probabilities, by apply a softmax function to the output layer. The softmax function takes a vector of real numbers and transforms it into a probability distribution by exponentiating each element and then normalizing by the sum of the exponentials. This way, the output of the network will be a valid probability distribution over the classes, which allows you to use the cross-entropy loss function for training.

The Tensorflow playground, when used for classification, uses the tanh activation function in the output layer, which means that the output values will be between -1 and 1, and that’s the two values of the (orange / blue) of the input data label. In this case, you can use the mean squared error loss function to measure the difference between the predicted output and the true labels. though of course you could also apply a softmax function to the output layer and use cross-entropy loss if you wanted to.

Implement the backpropagation algorithm to compute the gradients of the loss function with respect to each weight in the network.#

The backpropagation algorithm involves computing the gradients of the loss function with respect to each weight in the network using the chain rule of calculus. This involves propagating the error backward through the network, starting from the output layer and moving towards the input layer. For each layer, you will need to compute the error term for each neuron, which is based on the difference between the predicted output and the actual target values, as well as the weights and activation function used in that layer. You will then use these error terms to compute the gradients of the loss function with respect to each weight in that layer. The gradients will indicate how much each weight contributes to the error, and they will be used to update the weights in the next step.

While not hard to understand in principle, the backpropagation algorithm can be a bit tricky to implement correctly, especially when it comes to getting the indices right and ensuring that you’re applying the chain rule correctly. It’s important to carefully derive the equations for the gradients and to test your implementation on simple cases to ensure that it’s working correctly.

Implement the gradient descent algorithm to update the weights based on the computed gradients.#

Once you have computed the gradients of the loss function with respect to each weight in the network, you can use these gradients to update the weights in the direction that minimizes the loss function. The most common approach is to use gradient descent, which involves updating each weight by subtracting a fraction of the gradient from the current weight. The fraction is determined by a learning rate hyperparameter, which controls how much the weights are updated in each iteration. A smaller learning rate will result in smaller updates and slower convergence, while a larger learning rate will result in larger updates and faster convergence, but it can also lead to overshooting and divergence if it’s too large.

Train the neural network on your chosen dataset and evaluate its performance on a test set.#

After implementing the forward pass, loss function, backpropagation algorithm, and gradient descent, you can train your neural network on your chosen dataset. This involves iterating over the training data multiple times (epochs) and updating the weights based on the computed gradients in each iteration. You will need to monitor the loss function during training to ensure that it is decreasing, which indicates that the model is learning. After training, you will need to evaluate the performance of your neural network on a separate test set that was not used during training. This will allow you to assess how well the model generalizes to new data. You can use metrics such as accuracy, precision, recall, or F1 score to evaluate the performance of your model, depending on the nature of the problem (classification or regression).

There’s always more…#

There are many additional features you could implement to improve the performance of your neural network, such as regularization techniques (e.g., L1 or L2 regularization). You can also experiment with different hyperparameters, such as the learning rate, batch size, and number of epochs, to see how they affect the training process and the performance of your model. The key is to start with a simple implementation and then iteratively add features and improvements as you go along. One thing that’s can be quite enlightening is to visualize the decision boundaries of your trained model – that’s something that’s rather nicely done in the TensorFlow playground, and you can try to implement something similar in your code to see how the model is making its predictions.

Performance and Parallelizing with OpenMP#

Once you have a working implementation of your neural network, you can experiment with optimizing its performance. This could involve implementing more efficient algorithms for the forward pass and backpropagation, using more efficient data structures, or parallelizing the computations using OpenMP. Parallelizing the training process can significantly speed up the training time, especially for larger datasets (you can find or create some)and more complex neural network architectures. You can experiment with different levels of parallelism, such as parallelizing the computations within each layer or parallelizing the training across multiple batches of data.

The report#

As with the KdV project, you will need to write a report that describes your implementation and the results of your experiments. The report should include an introduction to the problem you are trying to solve, a description of the architecture of your neural network, an explanation of the training process, and an analysis of the results. You should also include any challenges you faced during implementation and how you addressed them. The report should be well-organized and clearly written, with appropriate figures and tables to illustrate your results.

While I’m not expecting you to re-derive the math of, e.g., the backpropagation algorithm, you should at least give a high-level overview of how it works and how you implemented it. You should also discuss any design choices you made in your implementation, such as the architecture of your neural network, the choice of activation function, and the choice of loss function. Finally, you should analyze the results of your experiments and discuss any insights you gained from them.