Parameters

- NORMALIZE: Whether or not the MNIST should be divided by 255, which is the max value for a pixel.
- BATCH_SIZE: The mini batch size used for the model. The idea is that an initial BATCH_SIZE examples are considered for training the network. Then, the weights are updated and the next BATCH_SIZE examples are considered
- NUM_EPOCHS: The number of training epochs used during the experiments
- NUM_FILTERS: The number of convolutional filters / feature maps applied
- NUM_POOL: The side length for max pooling operations
- NUM_CONV: The side length for convolution operations
- DROPOUT_RATE: The dropout rate. It acts as a form of regularizations by introducing stochastic connections in the network
- NUM_HIDDEN: Number of hidden neurons used in the dense network after applying convolutional and maxpooling operations
- VALIDATION_SPLIT: The percentage of training data used as validation data
- OPTIMIZER: One among SGD, Adam and RMSprop
- REGULARIZER: Either L1, or L2, or L1+L2 (ElasticNet)

At the beginning the 60.000 MNIST images are loaded. The training set contains 60000 examples, and the test set 10000 examples. Then, the data is converted into float32 the only format allowed by GPU computations. Optionally, the data is normalized. The true labels for training and test are converted from the original [0-9] set into One Hot Encoding (OHE) representation a prerequisite for the following classification.

Then, we implement the proper LetNet architecture. An initial set of convolutional operation of size NUM_CONV x NUM_CONV is applied. It produces NUM_FILTER outputs and uses a rectified linear activation function. The input of this layer is passed through a similar convolution layer and a subsequent maxpool layer with size NUM_POOL x NUM_POOL. For avoiding overfitting, an additional module drops out some connections with rate DROPOUT_RATE. This initial block is then repeated with a second identical block. Notice that Keras automatically computes the dimension of the data moving and transformed across different blocks.

After the proper convent layers two dense layers have been introduced. The first one is a layer with N_HIDDEN neurons, and the second one has NUM_CLASSES outputs which are aggregated into a single neuron with softmax activation. You might wonder why we are adopting this particular architecture and not another one perhaps simpler or more complex? Well, indeed the way in which convnet and maxpool operations are composed depends a lot on the specific domain and there are not necessarily theoretical motivations explaining the optimal composition. The suggestion is to start with something very simple then check the achieved performance and then iterate by adding more layers until gains are observed and the cost of execution is not increasing too much.

I know it seems a kind of magic, but the important aspect to understand is that even a relatively simple network like this one outperforms traditional machine learning techniques. The model is then compiled by using categorical_crossentropy as loss function and accuracy as metric. Besides that, an early stop criterion is also adopted.