Skip to main content

WIP: Making a neural network in C

·1059 words·5 mins
C NN Coding
Luĉjo
Author
Luĉjo
Student and professional cocoa enthusiast
Table of Contents

I wanted to implement a simple neural network in C for quite a while now - artificial intelligence and deep learning are changing the world around us and although I’ve had a deep learning course at my university, it was quite theoretical and the exercises were mostly based on existing Python libraries.

The goal of this project is to implement a dense neural network in C from scratch that can correctly recognise handwritten numbers.

You can find the implementation here:

The basics
#

The perceptron
#

The image of a perceptron

The perceptron is the most basic neuron possible, it has several numerical imputs, which are each multiplied with a weight. If the sum of all products of weights and inputs is greater than a threshold the neuron outputs a one otherwise a zero.

$$ \text{output} = \begin{cases} 0 &\text{if } \sum_j w_j \cdot x_j \leq \text{threshold} \newline 1 &\text{if } \sum_j w_j \cdot x_j > \text{threshold} \end{cases} $$

Though we can introduce a bias \(b\) to simplify the notation (the bias is just a number that tells us how easily a threshold can be reached, in fact it is equal to \(-\text{threshold}\)):

$$ \text{output} = \begin{cases} 0 &\text{if } \sum_j w_j \cdot x_j + b \leq 0 \newline 1 &\text{if } \sum_j w_j \cdot x_j + b > 0 \end{cases} $$

These neurons can be connected together to form a mesh where the first layer of neurons processes external inputs and gives them to the second layer before the final layer makes a binary decision in this case:

The image of a basic neural network

This is in fact how the first neural networks in early animals worked - we can still see the early stages of simple neuronal networks in jellyfish where the nerves form a net and a ring - the jellyfish responds to a small set of inputs (such as external presure, temperature) and transforms these inputs into a small set of possible outputs (such as to swim or not to swim) - except that the thresholds in animals are regulated by electrochemical potentials.

The image of a perceptron

Perceptrons can compute basic computational functions like AND, OR and NAND. Networks of perceptrons can be used to implement any logic function - so theoretically one could also build a computer out of neurons 🤠.

However unlike logic gates which cannot change and must be designed explicitly, neurons can be trained to change weights in order to solve a problem.

The Sigmoid neuron
#

Sigmoid neurons work similarly to the perceptron, but the inputs can be any number between 0 and 1 and the output is also between 0 and 1 as specificied by the sigmoid function:

$$ \text{output} = \frac{1}{1+e^{\sum_j w_j \cdot x_j - b}} $$

The sigmoid output function is continuous and smooth unlike the perceptron’s output, so a small change in the weights and the bias produces a small change in the output of the neuron:

$$ \Delta\text{output} \approx \sum_j \frac{\partial \text{output}}{\partial w_j} \Delta w_j + \frac{\partial \text{output}}{\partial b} \Delta b$$

Layers of neurons that are not inputs or output neurons are called hidden layers, though the name is a bit misleading.

Hidden layer

If we wanted to determine whether a 64x64 image is a nine then we’d need 4096 inputs and the output layer would only have one neuron, where if the output would be greater than 0.5 we’d say that we have a 9 and if it were less than that we’d say we don’t have a nine. The number of hidden neurons and layers is largely a question of heuristics.

Simple network for digit classification
#

We’re aiming for something like this (where each digit has 28x28 greyscale pixels):

Goal network

The neuron with the highest output value is equal to the digit in the picture. I’m using the MINST data set

MINST

The desired output \(y(x)\) for a given input \(x=6\) is then the vector: $$y(x)=\begin{pmatrix} 0 \newline 0 \newline 0 \newline 0 \newline 0 \newline 0 \newline 1 \newline 0 \newline 0 \newline 0 \newline \end{pmatrix}$$

We need to define a metric to see how well our system determines the output - this can be done with a cost function:

$$C(w, b) = \frac{1}{2n} \sum_x ||y(x)-a||^2$$

Where \(w\) are the weights, \(b\) the biases, \(n\) is the number of training inputs, \(a\) is the vector of outputs when \(x\) is input and \(y(x)\) is the correct solution. \(||x||\) is the norm of \(x\) which is typically the mean square error:

$$MSE= \frac{1}{n}\sum_{i=1}^n (y_i- \hat{y}_i)^2$$

Obviously we wish to minimise the cost function, but how can we do so? The answer is to use the gradient descent in order to find the (hopefully global) minimum of the multidimensional function. The gradient of the \(C\) is equal to:

$$\nabla C(w, b) = \begin{pmatrix} \frac{\partial C(w,b)}{\partial w} & \frac{\partial C(w,b)}{\partial b} \end{pmatrix}^T$$

and we can write \(\Delta C(v)\) as \(\Delta C(v) \approx \nabla C(v) \cdot \Delta v\). If we choose \(\Delta v = - \eta \nabla C(v)\) where \(\eta\) is a small number refered to as the learning rate and is ideally equal to \(\eta = \frac{||\Delta v||}{\nabla C(v)} \). Since \(||\nabla C(v)||^2 \geq 0\), the equation \(\Delta C(v) \approx -\eta ||\nabla C(v)||^2\) will always be smaller than 0 - so \(C(v)\) will decrease. So we can construct the rule:

$$w_{\text{new}}=w-\eta \frac{\partial C}{\partial w}$$ $$b_{\text{new}}=b-\eta \frac{\partial C}{\partial b}$$

Sadly calculating all derivatives with Newton’s method can be too costly, just because of the sheer number of weights.

In order to efficiently compute the gradient, we can simply compute a small sample of randomly chosen training inputs - averaging over this sample gives a pretty good estimate - this is called the stochastic gradient descent:

$$w_{\text{new}}=w-\frac{\eta}{m} \sum^m \frac{\partial C}{\partial w}$$ $$b_{\text{new}}=b-\frac{\eta}{m} \sum^m \frac{\partial C}{\partial b}$$

A great visualisation of everything I summarise here is available in this video:

The implementation
#

This implementation is a mix of the python code on the stochastic gradient descent in Nielsen’s book and Bhole’s neural network implementation in C but without backpropagation 😜.

Sources
#

Reply by Email