Overview of Experiment¶

The aim of this experiment is to demonstrate how data augmentation can be used as a regularizer to address overfitting issue when dealing with limited image dataset in the context of optical character recognition. The model used in this experiment is Lenet-5 (a vraiant of the Convolutional Neural Network, CNN).
The workflow will follow the diagram below.

Some of the forseeable challenges in this experiment is to determine how to control the amount of actual images to sample and control data augmentation rate. This will be explored in the following sections.

Lenet-5 Architecture¶

Lenet-5 architecture is one of the oldest variants of Convolutional Neural Network. It was first developed by Yann LeCun in 1998 and has been used in real word application such as identifying zip codes in sorting mails in the postal services, automating cheque deposit process and many more.

$Lenet5.jpg$

Figure showing architecture of Lenet-5 taken from (LeCunn et al., 1998)

Lenet-5 model takes 32x32 pixel image as an input. In total it employs 3 convolutional layers, each with 6, 16 and 120 filters respectively, with a 5x5 filter size. The stride for each convolutional layer is defined as 1 and the hyperbolic activation function, tanh. The first two convolutional layer is followed by an average pooling layer of 2x2 filter size with stride 2. After the third convolutional layer, two fully connected layer. The first fully connected layer has 84 neurons while the next one is fixed with 10 neurons and softmax activation since there is 10 number of digits (classes). Below is the summary of each layer:

Data preparation¶

In this section we will perform the following:

Import MNIST dataset
View MNIST dataset
Add padding to dataset
Add channel dimension to dataset
Portion dataset for our experiments

Import MNIST dataset¶

from tensorflow.keras import datasets

# get mnist dataset from tensorflow.keras
(X_train,y_train),(X_test,y_test) = datasets.mnist.load_data()

# view dataset size and pixel size
(X_train.shape, y_train.shape), (X_test.shape, y_test.shape)

(((60000, 28, 28), (60000,)), ((10000, 28, 28), (10000,)))

The MNIST dataset is in numpy array format with 3 dimensions.
It contains information on the size of dataset, the height of each images in dataset, and the width of each images. Note for most images there is a 4th dimension in the array which is the channel (RGB colour of the images).
The numpy shape shows that the MNIST dataset contains 70,000 images, each image of height 28 and width of 28 pixels. It does not contain channel information as the MNIST images are in grayscale.
X_train is allocated with 60,000 images with its corresponding label y_train
X_test is allocated with 60,000 images with its corresponding label y_test

View MNIST dataset¶

import matplotlib.pyplot as plt

# create a loop to display 10 images
count = 0
for i in range(10): 
  count = count + 1      
  images = X_train[i]
  # create subplot so that images are displayed in same line
  plt.subplot(1, 10, count)   
  plt.imshow(images, cmap="binary")
  plt.title(y_train[i])
  plt.axis('off') 
plt.show()

We can see that the images from MNIST dataset are arranged in a random order
The images X_train are displayed with their corresponding labels from y_train

Add padding to dataset¶

import tensorflow

# Lenet-5 model takes 32x32 pixel images, so we need to pad the mnsit images as they are 28x28
X_train = tensorflow.pad(X_train, [[0, 0], [2,2], [2,2]])/255
X_test = tensorflow.pad(X_test, [[0, 0], [2,2], [2,2]])/255

Warning: The code to pad should be run only once. Otherwise, new padding will be appended everytime it is run.

The Lenet-5 model takes 32x32 pixel size input images.
Since the MNIST dataset is 28x28, we need to pad the images to include 2 pixel on each side for the height and width.
The added pixels are given value of 0 and scaled by the pixel range of 255.
The padding here was done using a tensorflow function, but the same operation can also be done using numpy.pad

# view padded dataset size, height, width
X_train.shape, X_test.shape

(TensorShape([60000, 32, 32]), TensorShape([10000, 32, 32]))

Add channel dimension to dataset¶

# Lenet-5 model takes dataset with 4 dimensions (size, height, width and channel)
# MNIST dataset is greyscale, whereby it doesn't contain channel dimension
# Here we will add a dummy channel dimension
X_train = tensorflow.expand_dims(X_train, axis=3, name=None)    #axis = 3 because originally axis is [0,1,2] only
X_test = tensorflow.expand_dims(X_test, axis=3, name=None)

Warning: The code to pad should be run only once. Otherwise, new dimension will be appended everytime it is run.

CNN model generally takes input array with 4 dimensions.
Therefore, the code above will add a dummy channel (4th dimension) to our images dataset (which was having only 3 dimensions originally - size, height and width)
Adding dimension to numpy arrray can also be done using np.expand_dims.

# view padded dataset size, height, width and channel
X_train.shape, X_test.shape

(TensorShape([60000, 32, 32, 1]), TensorShape([10000, 32, 32, 1]))

Portion dataset for the experiment¶

After the dimension of dataset has been dealth with, we can now portion our dataset with various sizes for our experiment.
The idea is to train the model using image dataset of size 100, 1000, and 10,000 images and see which is the minimal number of images needed to produce a model with favourible loss and accuracy when evaluated against the test dataset X_test

# we will create training containing size of 50, 500, 1000, 5000, 10000 images 

# size 100
X_train_100 = X_train[:100,:,:,:] 
y_train_100 = y_train[:100]

# size 1000
X_train_1000 = X_train[550:1550,:,:,:] 
y_train_1000 = y_train[550:1550]

# size 10000
X_train_10000 = X_train[6550:16550,:,:,:] 
y_train_10000 = y_train[6550:16550]

# view the portioned training dataset
print (X_train_100.shape, X_train_1000.shape, X_train_10000.shape)
print (y_train_100.shape, y_train_1000.shape, y_train_10000.shape)

TensorShape([100, 32, 32, 1]) TensorShape([1000, 32, 32, 1]) TensorShape([10000, 32, 32, 1])
(100,) (1000,) (10000,)

# now we will reserve the last 5000 images from MNIST dataset for validation

X_val = X_train[-5000:,:,:,:] 
y_val = y_train[-5000:]

print (X_val.shape)
print (y_val.shape)

TensorShape([5000, 32, 32, 1])
(5000,)

Since the images in MNIST dataset is arranged randomly as previously identified, we segregated the first 50 images and its labels into X_train_50, the next 500 images into X_train_500 and so on.
The last 5000 images is reserved as the validation dataset X_val and its label y_label.
Validation dataset is helpful to evaluate the model and identify if there's issues such as overfitting or underfitting and can guide for further parameter tuning.

Building Lenet-5 Model¶

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Flatten, Conv2D, AveragePooling2D, Activation
from tensorflow.keras.utils import plot_model
from tensorflow.keras import optimizers

# define a function to build the Lenet-5 model
def lenet5_model():
  
  return Sequential ([
    Conv2D(6, 5, activation='tanh', input_shape=(32, 32, 1)),
    AveragePooling2D(2),
    #Activation('sigmoid'),
    Conv2D(16, 5, activation='tanh'),
    AveragePooling2D(2),
    #Activation('sigmoid'),
    Conv2D(120, 5, activation='tanh'),
    Flatten(),
    Dense(84, activation='tanh'),
    Dense(10, activation='softmax')
  ])

plot_model(lenet5_model())

The Lenet5 model is built according to the architecture discussed in previous section.
Here we defined a python function called lenet5_model that stores the model.

# define a function to train the Lenet-5 model
def train_lenet5_model(X_train, y_train, batch_size, epochs, run_name, X_val, y_val, X_test, y_test):
 
  model = lenet5_model()
  model.compile(optimizer='adam',
                loss='sparse_categorical_crossentropy',
                metrics=['accuracy'])
 
  my_callbacks = [
    tensorflow.keras.callbacks.TensorBoard(log_dir='./logs/run_' + run_name)    #save run logs using callbacks at path = /content/logs
    ]
 
  model.fit(x = X_train, 
            y = y_train,
            batch_size = batch_size,
            epochs = epochs, 
            validation_data = (X_val, y_val), 
            callbacks = my_callbacks,
            verbose = 0)    
  # note! set verbose = 0 since run logs are saved (there's need to print iteration progress). Loss and accuracy will be plotted and tracked in TensorBoard
 
  print ('\nModel trained using -> ' + run_name + ' evaluated on TEST dataset:')
  model.evaluate(X_test, y_test)
 
  return (model)

The model can now be compiled withbseveral arguements defined. *adam is chosen as optimizer. It is one of the more popular gradient descent optimization algorithm that facilitates learning. The optimizer aims at minimizing the loss by finding the optimal way to update the model parameters during back propagation.
sparse_categorical_crossentropyis the loss function used to compute the loss between label and predictions when dealing with 2 or more classes, which applies in our case.
Note that we also use a callback function. We use this to save the log of each model training run. In our case, more specifically callbacks are used to plot the model convergence behaviour using TensorBoard (a visualisatio tool from TensorFlow). Other application for callbacks include ModelCheckpoint(to save model in h5 format), EarlyStopping (to stop the model training after it has reached a convergence), and many more.
Once the model is compiled, we then fit / train it using the input data of our choice.
All of this process is stored under the function train_lenet5_model. User will have to define: X_train = training data y_train = training label batch_size = batch size epochs = number of epochs to run run_name = to name the folder thats stores the log file (if using google colab log is saved at /content/logs/'run_name') X_val = valdation data y_label = valdidation label X_test = testing data ** y_test = testing label
Note that we also include model.evaluate in the function that will evaluate model performance against testing data.

Training model using actual dataset¶

model1 = train_lenet5_model (X_train_100, y_train_100, 64, 30, 'actual_image_size_100', X_val, y_val, X_test, y_test)
model2 = train_lenet5_model (X_train_1000, y_train_1000, 64, 30, 'actual_image_size_1000', X_val, y_val, X_test, y_test)
model3 = train_lenet5_model (X_train_10000, y_train_10000, 64, 30, 'actual_image_size_10000', X_val, y_val, X_test, y_test)

Model trained using -> actual_image_size_100 evaluated on TEST dataset:
313/313 [==============================] - 3s 10ms/step - loss: 0.9721 - accuracy: 0.7072

Model trained using -> actual_image_size_1000 evaluated on TEST dataset:
313/313 [==============================] - 3s 10ms/step - loss: 0.3108 - accuracy: 0.9085

Model trained using -> actual_image_size_10000 evaluated on TEST dataset:
313/313 [==============================] - 3s 10ms/step - loss: 0.0921 - accuracy: 0.9747

We built model1, model2, and model3 using 100, 1000, and 10000 training data respectively. All of these training data is actual images from MNIST.
Furthermore, batch_size is defined as 64. This means in every epoch 64 number of images will be taken from the respective input dataset to train the model.
Each model is run upto 30 epochs for comparison sake when evaluating the model.
run_name is provided to place the log files.

 # we can display the convergence plot using TensorBoard (visualization toolkit from TensorFlow)
 %load_ext tensorboard
 %tensorboard --logdir='./logs'

$tensorboard_1.jpg$

Note: You might need to run the re-run the whole ipnyb file so that interative tensorboard is re-created. Here we include the screenshot of the tensorboard for reference.

The TensorBoard tool allows us to inspect each model performance: $convergence_comparison_actualimages.jpg$
model1 (using 100 training images) shows high loss for both training and validation dataset and significant overfitting issue
model2 (using 1000 training images) shows significant improvement on both training and validation loss. Overfitting issue seems to have been overcome as well.
model3 (using 10,000 training images) converges the fastest at epoch 10 with minmal loss for both training and validation dataset. Overfitting issue is successfully eliminated.

Comment:

As expected, the model with highest training dataset will perform the best in terms of loss and accuracy value, and also have good model generalizability. This means that the model is able to predict the unseen data X_test very well as it has been trained with sufficient amount of data.
In the case where we have limited data, say around 100 actual images, the model built will have a less desired performance. Having said that, to collect vast amount of actual images is a time consuming, cost demanding and labour intensive task.
To address the challenge when dealing with limited input data, data augmentation technique has been proposed to diversify the existing data. We will experiment with data augmentation in the next section. We will use only 100 actual images and create synthetic data from them with a set suitable of augmentation parameters.

Creating synthetic dataset (data augmentation)¶

Determine the augmentation parameter to use¶

from tensorflow.keras.preprocessing.image import ImageDataGenerator

# use the ImageDataGenerator function from tensorflow to perofrm data augmentation
datagen = ImageDataGenerator(
    rotation_range=45,  # Int. Degree angle range for random rotation - from keras
    shear_range=0.6,    # Float. Shear Intensity (Shear angle in counter-clockwise direction in degrees) - from keras
    zoom_range=0.4)     # Float or [lower, upper]. Range for random zoom. If a float, [lower, upper] = [1-zoom_range, 1+zoom_range] - from keras

In the code above we define the rotation, shear and zoom parameters for data augmentation.
Due to the slanting variation in handwriting style as shown in picture below (taken from the actual images from MNIST dataset), techniques such as rotation, and shearing would be applicable. In addition, some handwritings style can be small or large. Therefore, the rescaling (zoom in or out) option could be useful too.

Controlling the actual images to augment, and augmentation volume¶

i = 0
for batch in datagen.flow(X_train, y_train, batch_size=10,   #take 10 actual images (each digit) for 1 batch to perform augemntation
                            save_to_dir='/content/drive/MyDrive/Colab Notebooks/Assignment/test_image',
                            save_format='png'):
  i = i+1
  if i > 2:
    break

We first tested on data augmentation to understand how it works using the code above.
Input data (X_train) will be augmented using datagen (created from ImageDataGenerator) function and saves it in a .png format at the specified path: '/content/drive/MyDrive/Colab Notebooks/Assignment'.
The batch size batch_size=10 will take 10 random images from the input X_train and perform augmentation randomly.
Notice that data augmentation is performed through a loop. This loop will break after 3 times as we specified if i > 2, then break. This will return 3 different augmented images (for each actual input images), with the augmentation parameters (such as rotation_range, shear_range, zoom_range) that has been chosen previously.

import matplotlib.pyplot as plt
import matplotlib.image as mpimg

img1 = mpimg.imread('/content/drive/MyDrive/Colab Notebooks/Assignment/_7_2861.png')
img2 = mpimg.imread('/content/drive/MyDrive/Colab Notebooks/Assignment/_7_1173.png')
img3 = mpimg.imread('/content/drive/MyDrive/Colab Notebooks/Assignment/_7_3115.png')
images = [img1, img2, img3]
for i in range(3):  
  plt.subplot(1, 3, i+1) 
  imgplot = plt.imshow(images[i], cmap='binary')
  plt.axis('off') 
plt.show()

Here we can see 3 different synthetic images of digit 6 which has been augmented from the actual image from MNIST
The first image seemed to have been rotated and zoomed-in, the second image sheared, while the third image slightly zoomed-out.
All of this images can resemble real handwritting of different individuals and data augmentation demonstrates how we can diversify the input dataset to train the model given a limited actual images.
Now that we have extablished how data augmentation works, we can create a full augmented dataset as input to train our model.

Creating augmented data for model input¶

# generate augmented data with control on number of images
import numpy as np

augmented_data = []
augmented_data_label = []

for i in range (10):
  num_augmented = 0
  for X_batch, y_batch in datagen.flow(X_train, y_train, batch_size=10, shuffle=False):
      augmented_data.append(X_batch)
      augmented_data_label.append(y_batch)
      num_augmented += 1
      if num_augmented >= 100:
          break
augmented_data = np.concatenate(augmented_data)
augmented_data_label = np.concatenate(augmented_data_label)

augmented_data.shape, augmented_data_label.shape

((10000, 32, 32, 1), (10000,))

With batch_size=10, the code above will take 10 actual images from X_train and perform 100 data augmentation randomly on each batches in each run.
Since we defined a loop that runs 10 times, the augmentation process will be performed iteratively until 10 times, and the output augmented data (or synthetic data) will total up to 10,000 images.
Simply put, we are only sampling 100 actual images from X_train and use data augmentation technique to produce 10,000 synthetic images from them.
Using the code above, we can control the number of actual images to take from MNIST and control how many augmented data to produce.
The labelling of each images will be preserved and we can verify this by plotting the images with their corresponding labels.

# to view the augmented data, we need to reduce its dimension to 3 (as required by plt.imshow function)
images_3_dim = np.squeeze(augmented_data, axis=3) 
 
# plotting for the first 10 augemented data
count = 0
for i in range(10):   
  count = count + 1  
  images = images_3_dim[i]
  # create subplot so that images are displayed in same line
  plt.subplot(1, 10, count)   
  plt.imshow(images, cmap="binary")
  plt.title(augmented_data_label[i])
  plt.axis('off') 
plt.show()

# plotting for the next 10 augemented data
count = 0
for i in range(10,20):   
  count = count + 1  
  images = images_3_dim[i]
  # create subplot so that images are displayed in same line
  plt.subplot(1, 10, count)   
  plt.imshow(images, cmap="binary")
  plt.title(augmented_data_label[i])
  plt.axis('off') 
plt.show()

# plotting for the next 10 augemented data
count = 0
for i in range(20,30):   
  count = count + 1  
  images = images_3_dim[i]
  # create subplot so that images are displayed in same line
  plt.subplot(1, 10, count)   
  plt.imshow(images, cmap="binary")
  plt.title(augmented_data_label[i])
  plt.axis('off') 
plt.show()

It appears that the augmented data / images are labelled correctly with their corresponding label.
Hence, we can use this augmented dataset augmented_data and its label augmented_data_label to train our next model.

Training model using synthetic dataset (created from data augmentation)¶

model6 = train_lenet5_model(augmented_data, 
                   augmented_data_label, 
                   64, 
                   30, 
                   'synthetic_image_size_10000',
                   X_val,
                   y_val,
                   X_test,
                   y_test)

Model trained using -> synthetic_image_size_10000 evaluated on TEST dataset:
313/313 [==============================] - 3s 8ms/step - loss: 0.1980 - accuracy: 0.9499

model6 is trained using augmented_data of size 10,000 and validated against X_val, and tested against X_test (both validation, and test dataset are actual images from MNIST).
The batch size and number of epochs are kept the same as models trained with actual images for comparison sake.
The log for the model training is saved in /content/logs/run_synthetic_image_size_10000

%load_ext tensorboard
%tensorboard --logdir='./logs'

The tensorboard extension is already loaded. To reload it, use:
  %reload_ext tensorboard

Reusing TensorBoard on port 6006 (pid 11279), started 0:25:53 ago. (Use '!kill 11279' to kill it.)

Note: You might need to run the re-run the whole ipnyb file so that interative tensorboard is re-created. Here we include the screenshot of the tensorboard for reference.

Comment:

The convergence plot shows that model6 that was trained with 10,000 synthetic images (from only 100 actual images) have significantly lower losses, higher accuracy. Although overfitting is still observed, it is much more acceptable as compared to using only 100 actual images.
Having said that, the model performance is still not as good as model3 which was trained using 10,000 actual images.

# generate the confusion matrix for model6
from tensorflow import *

y_pred = model6.predict(X_test)
predicted_categories = tensorflow.argmax(y_pred, axis=1)
print ('Confusion Matrix:')
cm = tensorflow.math.confusion_matrix(predicted_categories, y_test)
cm

Confusion Matrix:

<tf.Tensor: shape=(10, 10), dtype=int32, numpy=
array([[ 952,    0,    0,    0,    1,   11,    3,    0,   10,    2],
       [   0, 1118,    0,    0,    3,    0,    4,    3,    2,    2],
       [   4,    6,  992,    9,    2,    1,    2,   48,   20,    0],
       [   0,    2,    9,  959,    1,    9,    0,    9,    6,    5],
       [   0,    0,    6,    1,  906,    0,    3,    2,   12,   10],
       [   1,    1,    3,   13,    2,  854,    8,    1,   10,   13],
       [   7,    3,    0,    0,    6,   12,  938,    0,    7,    0],
       [   2,    2,   12,   13,    6,    1,    0,  956,    2,   10],
       [   1,    2,    2,    9,    8,    2,    0,    1,  862,    5],
       [  13,    1,    8,    6,   47,    2,    0,    8,   43,  962]],
      dtype=int32)>

# plot confusion matrix
import matplotlib.pyplot as plt
import seaborn as sns

fig, ax = plt.subplots(figsize=(10,10))         # Sample figsize in inches
sns.heatmap(cm, annot=True, linewidths=.5, ax=ax)

<matplotlib.axes._subplots.AxesSubplot at 0x7f59f1cb9b50>

The confusion matrix shows that model6 trained with 10,000 synthetic images can predict all the the classes (different handwritten digits) well. This suggests that the data augmentation process created a balanced dataset, as samples are randomly chosen and augmented for each batch.

Discussion and Conclusion¶

In this experiment we have applied Lenet-5 model, one of the earliest variant of a CNN on text image dataset.
We first trained different model using 100, 1000, and 10,000 actual images from the MNIST dataset. As expected the model trained with just 100 input data resulted in overfitting loss: 0.9721 - accuracy: 0.7072, which means the model is only good at predicting the limited images that it has been trained on, but would have poor model generalizability (unable to properly predict unseen data). In contrary, model trained with 10,000 actual images converged to loss: 0.0921 - accuracy: 0.9747.
Having said that, in reality, collecting and compiling vast amount handwritten text is often a time consuming and cost demanding process. Many of the time one must make do with a limited number of samples.
To address issues when dealing with limited data set, data augmentation technique has been applied. It is a way to generate synthetic (augmented data) from existing images by manipulating the pixel values in the array. In the case of text images, only certain manipulation is suitable. As demonstrated in this experiment, we used, rotation and shearing - to address pattern on most handwritting styles of different individuals. We know this by visually inspecting the MNIST dataset and identified that variation in styles are dominated by different slanting of the characters. In addition, handwritting styles can vary also in size. Therefore we used the zoom parameter to incorporate such variation in the augmented dataset.
Moreover, We have also demonstrated how we can control the amount of augmented data by defining the batch size and the number of iteration data augmentation should be performed.
We generated 10,000 synthetic images from only 100 actual images. For each, actual imges sampled, 100 variation of it will be created randomly based on the augmentation parameter defined. The intuition is, model trained with 10,000 synthetic images would act as regularizer to address the overfitting issue we previously see when training the model with just 100 actual images.
The results show that model trained using 10,000 synthetic images performs well with loss: 0.1980 - accuracy: 0.9499 when tested on 5000 actual images. We learn that the model performs significantly better than the model trained with using just 100 actual images. However, the performance is still not as good as the model trained with 10,000 actual images. We suspect that there are still variation in the handwritten styles that has not been sampled for data augmentation.

Reference¶

LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278-2324