Skip to main content

WIP: Making a neural network in C

·2791 words·14 mins
C NN Coding
Luĉjo
Author
Luĉjo
Student and professional cocoa enthusiast
Table of Contents

I wanted to implement a simple neural network in C for quite a while now - artificial intelligence and deep learning are changing the world around us and although I’ve had a deep learning course at my university, it was quite theoretical and the exercises were mostly based on existing Python libraries.

The goal of this project is to implement a dense neural network in C from scratch that can correctly recognise handwritten numbers.

You can find the implementation here: codeberg repo=“Turingon/HomeNN”

The basics
#

The perceptron
#

The image of a perceptron

The perceptron is the most basic neuron possible, it has several numerical imputs, which are each multiplied with a weight. If the sum of all products of weights and inputs is greater than a threshold the neuron outputs a one otherwise a zero.

$$ \text{output} = \begin{cases} 0 &\text{if } \sum_j w_j \cdot x_j \leq \text{threshold} \newline 1 &\text{if } \sum_j w_j \cdot x_j > \text{threshold} \end{cases} $$

Though we can introduce a bias \(b\) to simplify the notation (the bias is just a number that tells us how easily a threshold can be reached, in fact it is equal to \(-\text{threshold}\)):

$$ \text{output} = \begin{cases} 0 &\text{if } \sum_j w_j \cdot x_j + b \leq 0 \newline 1 &\text{if } \sum_j w_j \cdot x_j + b > 0 \end{cases} $$

These neurons can be connected together to form a mesh where the first layer of neurons processes external inputs and gives them to the second layer before the final layer makes a binary decision in this case:

The image of a basic neural network

This is in fact how the first neural networks in early animals worked - we can still see the early stages of simple neuronal networks in jellyfish where the nerves form a net and a ring - the jellyfish responds to a small set of inputs (such as external presure, temperature) and transforms these inputs into a small set of possible outputs (such as to swim or not to swim) - except that the thresholds in animals are regulated by electrochemical potentials.

The image of a perceptron

Perceptrons can compute basic computational functions like AND, OR and NAND. Networks of perceptrons can be used to implement any logic function - so theoretically one could also build a computer out of neurons 🤠.

However unlike logic gates which cannot change and must be designed explicitly, neurons can be trained to change weights in order to solve a problem.

The Sigmoid neuron
#

Sigmoid neurons work similarly to the perceptron, but the inputs can be any number between 0 and 1 and the output is also between 0 and 1 as specificied by the sigmoid function:

$$ \text{output} = \frac{1}{1+e^{\sum_j w_j \cdot x_j - b}} $$

The sigmoid output function is continuous and smooth unlike the perceptron’s output, so a small change in the weights and the bias produces a small change in the output of the neuron:

$$ \Delta\text{output} \approx \sum_j \frac{\partial \text{output}}{\partial w_j} \Delta w_j + \frac{\partial \text{output}}{\partial b} \Delta b$$

Layers of neurons that are not inputs or output neurons are called hidden layers, though the name is a bit misleading.

Hidden layer

If we wanted to determine whether a 64x64 image is a nine then we’d need 4096 inputs and the output layer would only have one neuron, where if the output would be greater than 0.5 we’d say that we have a 9 and if it were less than that we’d say we don’t have a nine. The number of hidden neurons and layers is largely a question of heuristics.

Simple network for digit classification
#

We’re aiming for something like this (where each digit has 28x28 greyscale pixels):

Goal network

The neuron with the highest output value is equal to the digit in the picture. I’m using the MINST data set

MINST

The desired output \(y(x)\) for a given input \(x=6\) is then the vector: $$y(x)=\begin{pmatrix} 0 \newline 0 \newline 0 \newline 0 \newline 0 \newline 0 \newline 1 \newline 0 \newline 0 \newline 0 \newline \end{pmatrix}$$

We need to define a metric to see how well our system determines the output - this can be done with a cost function:

$$C(w, b) = \frac{1}{2n} \sum_x ||y(x)-a||^2$$

Where \(w\) are the weights, \(b\) the biases, \(n\) is the number of training inputs, \(a\) is the vector of outputs when \(x\) is input and \(y(x)\) is the correct solution. \(||x||\) is the norm of \(x\) which is typically the mean square error:

$$MSE= \frac{1}{n}\sum_{i=1}^n (y_i- \hat{y}_i)^2$$

Obviously we wish to minimise the cost function, but how can we do so? The answer is to use the gradient descent in order to find the (hopefully global) minimum of the multidimensional function. The gradient of the \(C\) is equal to:

$$\nabla C(w, b) = \begin{pmatrix} \frac{\partial C(w,b)}{\partial w} & \frac{\partial C(w,b)}{\partial b} \end{pmatrix}^T$$

and we can write \(\Delta C(v)\) as \(\Delta C(v) \approx \nabla C(v) \cdot \Delta v\). If we choose \(\Delta v = - \eta \nabla C(v)\) where \(\eta\) is a small number refered to as the learning rate and is ideally equal to \(\eta = \frac{||\Delta v||}{\nabla C(v)} \). Since \(||\nabla C(v)||^2 \geq 0\), the equation \(\Delta C(v) \approx -\eta ||\nabla C(v)||^2\) will always be smaller than 0 - so \(C(v)\) will decrease. So we can construct the rule:

$$w_{\text{new}}=w-\eta \frac{\partial C}{\partial w}$$ $$b_{\text{new}}=b-\eta \frac{\partial C}{\partial b}$$

Sadly calculating all derivatives with Newton’s method can be too costly, just because of the sheer number of weights.

In order to efficiently compute the gradient, we can simply compute a small sample of randomly chosen training inputs - averaging over this sample gives a pretty good estimate - this is called the stochastic gradient descent:

$$w_{\text{new}}=w-\frac{\eta}{m} \sum^m \frac{\partial C}{\partial w}$$ $$b_{\text{new}}=b-\frac{\eta}{m} \sum^m \frac{\partial C}{\partial b}$$

A great visualisation of everything I summarise here is available in this video:

The implementation
#

This implementation is a mix of the python code on the stochastic gradient descent and backpropagation in Nielsen’s book and the architecture of Bhole’s neural network implementation in C.

Loading the MNIST Dataset
#

One of the first challenges is to actually correctly read the MNIST dataset - since we’re working with C we have to know how the data is structured. The training images are available in a .idx3-ubyte-format. The data in stored in the following format:

magic number 
size in dimension 0 
size in dimension 1 
size in dimension 2 
..... 
size in dimension N 
data

The magic number (and the data) is written in the MSB format and tells us if we deal with a image or label dataset. The third byte tells us what kind of data we’re dealing with:

0x08: unsigned byte 
0x09: signed byte 
0x0B: short (2 bytes) 
0x0C: int (4 bytes) 
0x0D: float (4 bytes) 
0x0E: double (8 bytes)

The 4-th byte codes the number of dimensions of the vector/matrix: 1 for vectors, 2 for matrices…. The sizes in each dimension are 4-byte integers (MSB first, high endian, like in most non-Intel processors). The data is stored like in a C array, i.e. the index in the last dimension changes the fastest.

The only standard headers we’ll be working with will be:

#include <stdio.h> // for reading and printing outputs
#include <stdlib.h> // for malloc

Since Intel/AMD processors are LSB first we first need to convert the integers

unsigned int reverseBytes(unsigned int num) {
    return ((num >> 24) & 0x000000FF) |
           ((num >> 8) & 0x0000FF00) |
           ((num << 8) & 0x00FF0000) |
           ((num << 24) & 0xFF000000);
}

We’ll be saving the images as 1-byte arrays:

int main {
   int numImg;
   int imgSize;
   unsigned char** img;
   return 0;
}

We’ll print the first image to test if everything works, we’ll create a PPM image - this is the same image format that I’ve already used in my Raytracing adventure.

Raytracing with Go
·1614 words·8 mins
Go Graphics Coding

This way we don’t need any special headers or libraries and at least on Linux PPM-images are readable by default.

void toPPM(const char* filename, unsigned char* img, int width, int height) {
    FILE* file = fopen(filename, "w");
    if (!file) {
        printf("Couldn't open %s\n", filename);
        return;
    }

    fprintf(file, "P3\n");
    fprintf(file, "%d %d\n", width, height);
    fprintf(file, "255\n");

    for (int i = 0; i < height; i++) {
        for (int j = 0; j < width; j++) {
            unsigned char pixel = img[i * width + j];
            // since we're working with greyscale it holds that R=G=B
            fprintf(file, "%d %d %d ", pixel, pixel, pixel);
        }
        fprintf(file, "\n");
    }

    fclose(file);
}

Next up is the actual reading of the dataset and saving it in a dynamically allocated array. First we’ll be opening the file containing the data, then check the magic number (it should be 2051 for the image dataset), the number of images, rows, columns and finally we

unsigned char** loadImg(const char* filename, int* numImg, int* imgSize) {
    FILE* file = fopen(filename, "rb");
    if (!file) {
        printf("Couldn't open %s\n", filename);
        return NULL;
    }

    unsigned int magicNumber, nItems, nRows, nCols;
    fread(&magicNumber, sizeof(unsigned int), 1, file);
    magicNumber = reverseBytes(magicNumber);
    if (magicNumber != 2051) {
        printf("Invalid magic number\n");
        fclose(file);
        return NULL;
    }

    fread(&nItems, sizeof(unsigned int), 1, file);
    *numImg = reverseBytes(nItems);
    fread(&nRows, sizeof(unsigned int), 1, file);
    fread(&nCols, sizeof(unsigned int), 1, file);
    *imgSize = reverseBytes(nRows) * reverseBytes(nCols);

    unsigned char** img = (unsigned char**)malloc(*numImg * sizeof(unsigned char*));
    for (int i = 0; i < *numImg; i++) {
        img[i] = (unsigned char*)malloc(*imgSize * sizeof(unsigned char));
        fread(img[i], sizeof(unsigned char), *imgSize, file);
    }

    fclose(file);
    return img;
}

So now we can finalise our main function:

int main(void) {
   int numImg;
   int imgSize;
   unsigned char** img = = loadImg("train-images-idx3-ubyte", &numImg, &imgSize);

   toPPM("first_image.ppm", img[0], 28, 28);

   for (int i = 0; i < numImg; i++) {
      free(img[i]);
   }
   free(img);
   return 0;
}

And it works! The first two images are:

The next step is to read out the correct image label values, which also also provided by NIST. The magic number for the label set is 2049.

unsigned char* loadLabels(const char* filename, int* numLabels) {
    FILE* file = fopen(filename, "rb");
    if (!file) {
        printf("Couldn't open %s\n", filename);
        return NULL;
    }

    unsigned int magicNumber, nItems;
    fread(&magicNumber, sizeof(unsigned int), 1, file);
    magicNumber = reverseBytes(magicNumber);
    if (magicNumber != 2049) {
        printf("Invalid magic number\n");
        fclose(file);
        return NULL;
    }

    fread(&nItems, sizeof(unsigned int), 1, file);
    *numLabels = reverseBytes(nItems);

    unsigned char* labels = (unsigned char*)malloc(*numLabels * sizeof(unsigned char));
    fread(labels, sizeof(unsigned char), *numLabels, file);

    fclose(file);
    return labels;
}

Giving the final main function:

int main(void) {
    int numImg;
    int imgSize;
    unsigned char** img = loadImg("train-images.idx3-ubyte", &numImg, &imgSize);
    int numLabels;
    unsigned char* labels = loadLabels("train-labels.idx1-ubyte", &numLabels);
 
    toPPM("first_image.ppm", img[0], 28, 28);
    printf("The labels are %u, %u\n", labels[0], labels[1]);
 
    for (int i = 0; i < numImg; i++) {
       free(img[i]);
    }
    free(img);
    return 0;
 }

And we get 5 and 0, just as expected! All of this could easily be done with a few Python functions, so it’s no surprise Python and R are so dominant in data science. Now we can start programming the actual neural network.

Architecture of the neural network
#

Here we’ll create two new structs starting with neuron:


typedef struct neuron {
    float a;       // Activation value
    float *w; // Weights to the next layer
    float b;       // Bias term
    float z;          // Weighted sum of inputs
    float da;      // Derivative of activation function
    float dz;      // Derivative of the "weighted input"
    float *dw;        // Gradient of cost weights
    float db;      // Gradient of cost bias
} neuron;

It will contain the activation value, bias, weights as well as values we’ll need for stochastic gradient descent:

typedef struct layer {
    int numNeuro;      // Number of neurons
    struct neuron *neuro; // Array of neurons
} layer;

The layer-struct represents the layer which will contain the neurons.

Next the layers, weights and neurons need to be inititalised. We’ll create 28^2 nerons in the input layer, 15 neurons in the hidden layer and of course 10 neurons in the output layer. In addition to that we also set the values of the biases and weights to random small values (so random numbers between 0.0 and 1.0)

void init_architecture(struct layer *layers) {
    srand(42);

    // Input Layer
    layers[0].numNeuro = 28 * 28;
    layers[0].neuro = (struct neuron *)malloc(layers[0].numNeuro * sizeof(struct neuron));
    for (int i = 0; i < layers[0].numNeuro; i++) {
        layers[0].neuro[i].w = (float *)malloc(15 * sizeof(float));
        layers[0].neuro[i].b = ((float)rand() / RAND_MAX);
        layers[0].neuro[i].z = 0.0;
        layers[0].neuro[i].a = 0.0;
        layers[0].neuro[i].da = 0.0;
        layers[0].neuro[i].dw = (float *)malloc(15 * sizeof(float));
        layers[0].neuro[i].dz = 0.0;
        layers[0].neuro[i].db = 0.0;
        for (int j = 0; j < 15; j++) {
            layers[0].neuro[i].w[j] = ((float)rand() / RAND_MAX);
            layers[0].neuro[i].dw[j] = 0.0; 
        }
    }

    // Hidden layer
    layers[1].numNeuro = 15;
    layers[1].neuro = (struct neuron *)malloc(layers[1].numNeuro * sizeof(struct neuron));
    for (int i = 0; i < layers[1].numNeuro; i++) {
        layers[1].neuro[i].w = (float *)malloc(10 * sizeof(float));
        layers[1].neuro[i].b = ((float)rand() / RAND_MAX);
        layers[1].neuro[i].z = 0.0;
        layers[1].neuro[i].a = 0.0;
        layers[1].neuro[i].da = 0.0;
        layers[1].neuro[i].dw = (float *)malloc(10 * sizeof(float));
        layers[1].neuro[i].db = 0.0;
        for (int j = 0; j < 10; j++) {
            layers[1].neuro[i].w[j] = ((float)rand() / RAND_MAX);
            layers[1].neuro[i].dw[j] = 0.0;
        }
    }

    // Output layer
    layers[2].numNeuro = 10;
    layers[2].neuro = (struct neuron *)malloc(layers[2].numNeuro * sizeof(struct neuron));
    for (int i = 0; i < layers[2].numNeuro; i++) {
        layers[2].neuro[i].w = NULL; // Set to NULL because there are no more weights
        layers[2].neuro[i].b = ((float)rand() / RAND_MAX);
        layers[2].neuro[i].z = 0.0; 
        layers[2].neuro[i].a = 0.0;
        layers[2].neuro[i].da = 0.0;
        layers[2].neuro[i].dw = NULL;
        layers[2].neuro[i].db = 0.0;
    }
}

Preparing the forward pass
#

Firstly we’ll have to implement the exponential function. It is the only function we’ll be needing from math.h that’s why we’re not importing this header. We’ll use the taylor series of degree 3 (within \([-1,1]\) this yields a minimal error) and restrict ourselves to floats (doubles will give me accuracy, but I aim for memory efficiency):

$$e^x = \sum_{n=0}^{\infty} \frac{x^n}{n!}\approx 1 + x + \frac{x^2}{2}+\frac{x^3}{6}$$

float exp(float x) {
    return 1.0 + x + x*x/2.0 + x*x*x/6.0;
}

We now define the sigmod function:

float sigmoid(float x) {
    return 1.0 / (1.0 + exp(-x));
}

To implement the feed-forward function we need to apply this equation:

$$\textit{a}^i_j = \sigma(\sum_k \textit{w}^i_{jk} \cdot a_k^{i-1}+ b_j^i)$$

In this equation \(a\) is the activation, \(w\) are the outgoing weights, \(b\) are the biases, \(i\) represents the layer and \(j\) is the neuron number. This is in line with the notation from

Visualisation of the notation

In the feedforward function we also divide the input by 256 to ensure it is normalised (this will make sure our values are compatible with the Sigmoid function):

void normalize_input(float *input) {
    for (int i = 0; i < 784; i++) {
        input[i] /= 256.0;
    }
}

and now the feedforward function itself:


void feedforward(struct layer *layers, float *input) {
    for (int i = 0; i < layers[0].numNeuro; i++) {
        layers[0].neuro[i].a = input[i];
    }

    for (int l = 1; l < 3; l++) {
        for (int j = 0; j < layers[l].numNeuro; j++) {
            layers[l].neuro[j].z = layers[l].neuro[j].b;
            for (int k = 0; k < layers[l-1].numNeuro; k++) {
                layers[l].neuro[j].z += layers[l-1].neuro[k].a * layers[l].neuro[j].w[k];
            }

            layers[l].neuro[j].a = sigmoid(layers[l].neuro[j].z);
        }
    }

    for (int i = 0; i < layers[2].numNeuro; i++) {
        printf("OUTPUT: %f\n", layers[2].neuro[i].a);
    }
}

Preparing the backward pass
#

The backward pass will update the weights to minimise the mean squared error (as defined above). In order to do so it is useful to know that the Sigmoid function, \(S(x)\), has a simple derivative:

$$S’(x) = S(x)(1-S(x))$$

Back to backpropagation. We first define the weighted input quantity \(z^l = w^l a^{l-1}+b^l\). This quantity allows us to define the error of neuron j in layer l with:

$$\delta^l_{j} = \frac{\partial C}{\partial z^l_j}$$

Using this definition it is possible to build backpropagation on four basic equations (here is the discrete form from Bhole):

$$\delta^{\textit{output layer}} = (a^i(x)-y(x))\cdot S’(z)$$

$$\delta^{\textit{hidden layer}} = (w^{i+1} \cdot dz^{i+1})\cdot S’(z)$$

$$dw_{jk}^i = dz_j^i \cdot a^{i-1}_k$$

$$db^i = dz_j^i \cdot a^{i-1}_k$$

This all gives the following code:

void back_prop(struct layer *layers, float *desired_outputs) {
    int i, j, k;

    // Output Layer
    for (j = 0; j < layers[2].numNeuro; j++) {
        layers[2].neuro[j].dz = (layers[2].neuro[j].a - desired_outputs[j]) *
                                             layers[2].neuro[j].a *
                                             (1 - layers[2].neuro[j].a);

        for (k = 0; k < layers[1].numNeuro; k++) {
            layers[1].neuro[k].dw[j] = layers[2].neuro[j].dz * layers[1].neuro[k].a;
            layers[1].neuro[k].da += layers[1].neuro[k].w[j] * layers[2].neuro[j].dz;
        }

        layers[2].neuro[j].db = layers[2].neuro[j].dz;
    }

    // Hidden Layers
    for (i = 1; i > 0; i--) {
        for (j = 0; j < layers[i].numNeuro; j++) {
            layers[i].neuro[j].dz = layers[i].neuro[j].da * layers[i].neuro[j].a * (1 - layers[i].neuro[j].a);

            for (k = 0; k < layers[i - 1].numNeuro; k++) {
                layers[i - 1].neuro[k].dw[j] = layers[i].neuro[j].dz * layers[i - 1].neuro[k].a;

                if (i > 1) {
                    layers[i - 1].neuro[k].da += layers[i - 1].neuro[k].w[j] * layers[i].neuro[j].dz;
                }
            }

            layers[i].neuro[j].db = layers[i].neuro[j].dz;
        }
    }
}

Sources
#

Reply by Email