Sunday, May 8, 2016

Demystifying deep learning: explaining a LeNet like code

In a previous post we introduced the LeNet like convnet. Now let's discuss the code. Keras implements both Convolutional and Maxpooling modules, together with l1 and l2 regularizers and with several optimizer methods such as Stochastic Gradient Descent, Adam and RMSprop. In the following code print_Graph is an utility function used to print the results of different experiments when we change the hyper-parameters. Our convnet is defined by convNet_LeNet, a function accepting multiple input parameters.

Parameters

NORMALIZE: Whether or not the MNIST should be divided by 255, which is the max value for a pixel.
BATCH_SIZE: The mini batch size used for the model. The idea is that an initial BATCH_SIZE examples are considered for training the network. Then, the weights are updated and the next BATCH_SIZE examples are considered
NUM_EPOCHS: The number of training epochs used during the experiments
NUM_FILTERS: The number of convolutional filters / feature maps applied
NUM_POOL: The side length for max pooling operations
NUM_CONV: The side length for convolution operations
DROPOUT_RATE: The dropout rate. It acts as a form of regularizations by introducing stochastic connections in the network
NUM_HIDDEN: Number of hidden neurons used in the dense network after applying convolutional and maxpooling operations
VALIDATION_SPLIT: The percentage of training data used as validation data
OPTIMIZER: One among SGD, Adam and RMSprop
REGULARIZER: Either L1, or L2, or L1+L2 (ElasticNet)

At the beginning the 60.000 MNIST images are loaded. The training set contains 60000 examples, and the test set 10000 examples. Then, the data is converted into float32 the only format allowed by GPU computations. Optionally, the data is normalized. The true labels for training and test are converted from the original [0-9] set into One Hot Encoding (OHE) representation a prerequisite for the following classification.

Then, we implement the proper LetNet architecture. An initial set of convolutional operation of size NUM_CONV x NUM_CONV is applied. It produces NUM_FILTER outputs and uses a rectified linear activation function. The input of this layer is passed through a similar convolution layer and a subsequent maxpool layer with size NUM_POOL x NUM_POOL. For avoiding overfitting, an additional module drops out some connections with rate DROPOUT_RATE. This initial block is then repeated with a second identical block. Notice that Keras automatically computes the dimension of the data moving and transformed across different blocks.

After the proper convent layers two dense layers have been introduced. The first one is a layer with N_HIDDEN neurons, and the second one has NUM_CLASSES outputs which are aggregated into a single neuron with softmax activation. You might wonder why we are adopting this particular architecture and not another one perhaps simpler or more complex? Well, indeed the way in which convnet and maxpool operations are composed depends a lot on the specific domain and there are not necessarily theoretical motivations explaining the optimal composition. The suggestion is to start with something very simple then check the achieved performance and then iterate by adding more layers until gains are observed and the cost of execution is not increasing too much.

I know it seems a kind of magic, but the important aspect to understand is that even a relatively simple network like this one outperforms traditional machine learning techniques. The model is then compiled by using categorical_crossentropy as loss function and accuracy as metric. Besides that, an early stop criterion is also adopted.

Saturday, May 7, 2016

Demystifying deep learning series: hands on experimental sessions with Convnets

This is the first of a series of hands on series where I'll explain deep learning step-by-step and with a lot of experimental results. Let's start from a classical but hard enough problem: recognizing hand written numbers.

How many times have you thought is that a 4 or a 9 when your best mate wrote a number on a piece of paper? Well, if that's hard for humans how possibly could it be simpler for a computer to learn. Welcome in the kingdom of deep learning where certain tasks can be taught to computer with super-humans capacity. And when I say "taught" I mean it. Here, we don't code algorithms for solving problems. No, here we code algorithms for learning how to solve a problem. Then, we take a bunch of examples and the computer will learn from them. Kinda of cool, no?

So let's start.

First, we need a dataset with handwritten characters and luckily we have one handy. That's MNIST (http://yann.lecun.com/exdb/mnist/) which is produced by Yan LeCun the guru of deep learning, currently at Facebook. He invented something known as ConvNets which broke any previous result in learning in so many different application domains. I think he will get the Turing Award one day. Convnets are simple and effective as we will see in follow up posting.

Second, we need some high level library for coding deep-learning in a simple and effective way. Here we are super-lucky because in the last year there has been a Cambrian explosion of deep learning libraries with all the big players giving a contribution from Google, to Facebook, to Microsoft, to the Academic world. After testing many (Theano, Google's Tensorflow, Lasagne, Block, Neon) I decided to go for Keras because it is clean and minimalist. Plus it runs on the top of Theano and TensorFlow which are the state of the art today and you can switch the backend transparently. Keras supports both CPUs and GPUs computation.

Third, let's show directly some code which I wrote and can get to an accuracy of >98%

import numpy as np
import matplotlib.pyplot as plt
import time
np.random.seed(1111)  # for reproducibility
 
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation, Flatten
from keras.layers.convolutional import Convolution2D, MaxPooling2D
from keras.utils import np_utils
from keras.regularizers import l2, activity_l2
from keras.utils.visualize_util import plot
from keras.optimizers import SGD, Adam, RMSprop
from keras.callbacks import EarlyStopping

import inspect

#
# save the graph produced by the experiment
#
def print_Graph(
 # Training log
 fitlog, 
 # elapsed time
 elapsed, 
 # input parameters for the experiment
 args, 
 # input values for the experiment
 values):

 experiment_label = "\n".join(['%s=%s' % (i, values[i]) for i in args])
 experiment_file = experiment_label+"-Time= %02d" % elapsed + "sec"
 experiment_file = experiment_file.replace("\n", "-")+'.png'

 fig = plt.figure(figsize=(6, 3))
 plt.plot(fitlog.history["val_acc"])
 plt.title('val_accuracy')
 plt.ylabel('val_accuracy')
 plt.xlabel('iteration')
 fig.text(.7,.15,experiment_label, size='6')
 plt.savefig(experiment_file, format="png")

#
# A LeNet-like convnet for classifying MINST handwritten characters 28x28
#
def convNet_LeNet(

 VERBOSE=1,
 # normlize
 NORMALIZE = True,
 # Network Parameters
 BATCH_SIZE = 128,
 NUM_EPOCHS = 20,
 # Number of convolutional filters 
 NUM_FILTERS = 32,
 # side length of maxpooling square
 NUM_POOL = 2,
 # side length of convolution square
 NUM_CONV = 3,
 # dropout rate for regularization
 DROPOUT_RATE = 0.5,
 # hidden number of neurons first layer
 NUM_HIDDEN = 128,
 # validation data
 VALIDATION_SPLIT=0.2, # 20%
 # optimizer used
 OPTIMIZER = SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)
 ): 

 # Output classes, number of MINST DIGITS
 NUM_CLASSES = 10
 # Shape of an MINST digit image
 SHAPE_X, SHAPE_Y = 28, 28
 # Channels on MINST
 IMG_CHANNELS = 1

 # LOAD the MINST DATA split in training and test data
 (X_train, Y_train), (X_test, Y_test) = mnist.load_data()
 X_train = X_train.reshape(X_train.shape[0], 1, SHAPE_X, SHAPE_Y)
 X_test = X_test.reshape(X_test.shape[0], 1, SHAPE_X, SHAPE_Y)

 # convert in float32 representation for GPU computation
 X_train = X_train.astype("float32")
 X_test = X_test.astype("float32")

 if (NORMALIZE):
  # NORMALIZE each pixerl by dividing by max_value=255
  X_train /= 255
  X_test /= 255
 print('X_train shape:', X_train.shape)
 print(X_train.shape[0], 'train samples')
 print(X_test.shape[0], 'test samples')
  
 # KERAS needs to represent each output class into OHE representation
 Y_train = np_utils.to_categorical(Y_train, NUM_CLASSES)
 Y_test = np_utils.to_categorical(Y_test, NUM_CLASSES)

 nn = Sequential()
  
 #FIRST LAYER OF CONVNETS, POOLING, DROPOUT
 #  apply a NUM_CONV x NUM_CONF convolution with NUM_FILTERS output
 #  for the first layer it is also required to define the input shape
 #  activation function is rectified linear 
 nn.add(Convolution2D(NUM_FILTERS, NUM_CONV, NUM_CONV, 
  input_shape=(IMG_CHANNELS, SHAPE_X, SHAPE_Y) ))
 nn.add(Activation('relu'))
 nn.add(Convolution2D(NUM_FILTERS, NUM_CONV, NUM_CONV))
 nn.add(Activation('relu'))
 nn.add(MaxPooling2D(pool_size = (NUM_POOL, NUM_POOL)))
 nn.add(Dropout(DROPOUT_RATE))

 #SECOND LAYER OF CONVNETS, POOLING, DROPOUT 
 #  apply a NUM_CONV x NUM_CONF convolution with NUM_FILTERS output
 nn.add(Convolution2D( NUM_FILTERS, NUM_CONV, NUM_CONV))
 nn.add(Activation('relu'))
 nn.add(Convolution2D(NUM_FILTERS, NUM_CONV, NUM_CONV))
 nn.add(Activation('relu'))
 nn.add(MaxPooling2D(pool_size = (NUM_POOL, NUM_POOL) ))
 nn.add(Dropout(DROPOUT_RATE))
  
 # FLATTEN the shape for dense connections 
 nn.add(Flatten())
  
 # FIRST HIDDEN LAYER OF DENSE NETWORK
 nn.add(Dense(NUM_HIDDEN))  
 nn.add(Activation('relu'))
 nn.add(Dropout(DROPOUT_RATE))          

 # OUTFUT LAYER with NUM_CLASSES OUTPUTS
 # ACTIVATION IS SOFTMAX, REGULARIZATION IS L2
 nn.add(Dense(NUM_CLASSES, W_regularizer=l2(0.01) ))
 nn.add(Activation('softmax') )

 #summary
 nn.summary()
 #plot the model
 plot(nn)

 # set an early-stopping value
 early_stopping = EarlyStopping(monitor='val_loss', patience=2)

 # COMPILE THE MODEL
 #   loss_function is categorical_crossentropy
 #   optimizer is parametric
 nn.compile(loss='categorical_crossentropy', 
  optimizer=OPTIMIZER, metrics=["accuracy"])

 start = time.time()
 # FIT THE MODEL WITH VALIDATION DATA
 fitlog = nn.fit(X_train, Y_train, \
  batch_size=BATCH_SIZE, nb_epoch=NUM_EPOCHS, \
  verbose=VERBOSE, validation_split=VALIDATION_SPLIT, \
  callbacks=[early_stopping])
 elapsed = time.time() - start

 # Test the network
 results = nn.evaluate(X_test, Y_test, verbose=VERBOSE)
 print('accuracy:', results[1])

 # just to get the list of input parameters and their value
 frame = inspect.currentframe()
 args, _, _, values = inspect.getargvalues(frame)
 # used for printing pretty arguments

 print_Graph(fitlog, elapsed, args, values)

 return fitlog  

# 2 epochs
#log = convNet_LeNet(OPTIMIZER = 'Adam', NUM_EPOCHS=2)
#print(log.history)
# 20 epochs
#log = convNet_LeNet(OPTIMIZER = 'Adam', NUM_EPOCHS=20)
#print(log.history)
# default optimizer = SGD
#log = convNet_LeNet(NUM_EPOCHS=20)
#print(log.history)
# default optimizer = RMSProp
#log = convNet_LeNet(OPTIMIZER=RMSprop(), NUM_EPOCHS=20)
#print(log.history)
## default optimizer 
#log = convNet_LeNet(OPTIMIZER='Adam', DROPOUT_RATE=0)
#print(log.history)
# default optimizer 
#log = convNet_LeNet(OPTIMIZER='Adam', DROPOUT_RATE=0.1)
#print(log.history)
# default optimizer 
#log = convNet_LeNet(OPTIMIZER='Adam', DROPOUT_RATE=0.2)
#print(log.history)
# default optimizer 
#log = convNet_LeNet(OPTIMIZER='Adam', DROPOUT_RATE=0.4)
#print(log.history)
# default optimizer 
#log = convNet_LeNet(OPTIMIZER='Adam', BATCH_SIZE=64)
#print(log.history)
#log = convNet_LeNet(OPTIMIZER='Adam', BATCH_SIZE=128)
#print(log.history)
#log = convNet_LeNet(OPTIMIZER='Adam', BATCH_SIZE=256)
#print(log.history)
#log = convNet_LeNet(OPTIMIZER='Adam', BATCH_SIZE=512)
#print(log.history)
#log = convNet_LeNet(OPTIMIZER='Adam', BATCH_SIZE=1024)
#print(log.history)
#log = convNet_LeNet(OPTIMIZER='Adam', BATCH_SIZE=2048)
#print(log.history)
#
#log = convNet_LeNet(OPTIMIZER='Adam', BATCH_SIZE=4096)
#print(log.history)
#log = convNet_LeNet(OPTIMIZER='Adam', VALIDATION_SPLIT=0.8)
#print(log.history)
#log = convNet_LeNet(OPTIMIZER='Adam', VALIDATION_SPLIT=0.6)
#print(log.history)
#log = convNet_LeNet(OPTIMIZER='Adam', VALIDATION_SPLIT=0.4)
#print(log.history)
#log = convNet_LeNet(OPTIMIZER='Adam', VALIDATION_SPLIT=0.2)
#print(log.history)
#log = convNet_LeNet(OPTIMIZER='Adam', VALIDATION_SPLIT=0.2, NORMALIZE=False)
#print(log.history)
#log = convNet_LeNet(OPTIMIZER='Adam', VALIDATION_SPLIT=0.2, NUM_FILTERS=64)
#print(log.history)
log = convNet_LeNet(OPTIMIZER='Adam', NUM_FILTERS=128)
print(log.history)
# log = convNet_LeNet(OPTIMIZER='Adam', NUM_FILTERS=256)
# print(log.history)
# x log = convNet_LeNet(OPTIMIZER='Adam', NUM_POOL=4)
# x print(log.history)
# log = convNet_LeNet(OPTIMIZER='Adam', NUM_POOL=8)
# print(log.history)
#log = convNet_LeNet(OPTIMIZER='Adam', NUM_CONV=4)
#print(log.history)
# x log = convNet_LeNet(OPTIMIZER='Adam', NUM_CONV=8)
# x print(log.history)
#log = convNet_LeNet(OPTIMIZER='Adam', NUM_HIDDEN=32)
#print(log.history)
#log = convNet_LeNet(OPTIMIZER='Adam', NUM_HIDDEN=64)
#print(log.history)
#log = convNet_LeNet(OPTIMIZER='Adam', NUM_HIDDEN=256)
#print(log.history)
#log = convNet_LeNet(OPTIMIZER='Adam', NUM_HIDDEN=512)
#print(log.history)
#log = convNet_LeNet(OPTIMIZER='Adam', NUM_HIDDEN=1024)
#print(log.history)



 # VERBOSE=1,
 # # normlize
 # NORMALIZE = True,
 # # Network Parameters
 # BATCH_SIZE = 128,
 # NUM_EPOCHS = 100,
 # # Number of convolutional filters 
 # NUM_FILTERS = 32,
 # # side length of maxpooling square
 # NUM_POOL = 2,
 # # side length of convolution square
 # NUM_CONV = 3,
 # # dropout rate for regularization
 # DROPOUT_RATE = 0.5,
 # # hidden number of neurons first layer
 # N_HIDDEN = 128,
 # # validation data
 # VALIDATION_SPLIT=0.2, # 20%
 # # optimizer used
 # OPTIMIZER = SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)


#plt.show()

Next posting is about describing the code. Then, you will see dozens of experiments for exploring the hyper-parameters' space and inferring some rules of thumbs for fine tuning our deep learning nets.

Stay tuned, during the next months we will see more than 20 nets for deep learning in different contexts and show super-human capacity