The aim of this experiment is to demonstrate how data augmentation can be used as a regularizer to address overfitting issue when dealing with limited image dataset in the context of optical character recognition. The model used in this experiment is Lenet-5 (a vraiant of the Convolutional Neural Network, CNN).
The workflow will follow the diagram below.
Figure showing architecture of Lenet-5 taken from (LeCunn et al., 1998)
In this section we will perform the following:
from tensorflow.keras import datasets
# get mnist dataset from tensorflow.keras
(X_train,y_train),(X_test,y_test) = datasets.mnist.load_data()
# view dataset size and pixel size
(X_train.shape, y_train.shape), (X_test.shape, y_test.shape)
X_train
is allocated with 60,000 images with its corresponding label y_train
X_test
is allocated with 60,000 images with its corresponding label y_test
import matplotlib.pyplot as plt
# create a loop to display 10 images
count = 0
for i in range(10):
count = count + 1
images = X_train[i]
# create subplot so that images are displayed in same line
plt.subplot(1, 10, count)
plt.imshow(images, cmap="binary")
plt.title(y_train[i])
plt.axis('off')
plt.show()
X_train
are displayed with their corresponding labels from y_train
import tensorflow
# Lenet-5 model takes 32x32 pixel images, so we need to pad the mnsit images as they are 28x28
X_train = tensorflow.pad(X_train, [[0, 0], [2,2], [2,2]])/255
X_test = tensorflow.pad(X_test, [[0, 0], [2,2], [2,2]])/255
numpy.pad
# view padded dataset size, height, width
X_train.shape, X_test.shape
# Lenet-5 model takes dataset with 4 dimensions (size, height, width and channel)
# MNIST dataset is greyscale, whereby it doesn't contain channel dimension
# Here we will add a dummy channel dimension
X_train = tensorflow.expand_dims(X_train, axis=3, name=None) #axis = 3 because originally axis is [0,1,2] only
X_test = tensorflow.expand_dims(X_test, axis=3, name=None)
np.expand_dims
.# view padded dataset size, height, width and channel
X_train.shape, X_test.shape
X_test
# we will create training containing size of 50, 500, 1000, 5000, 10000 images
# size 100
X_train_100 = X_train[:100,:,:,:]
y_train_100 = y_train[:100]
# size 1000
X_train_1000 = X_train[550:1550,:,:,:]
y_train_1000 = y_train[550:1550]
# size 10000
X_train_10000 = X_train[6550:16550,:,:,:]
y_train_10000 = y_train[6550:16550]
# view the portioned training dataset
print (X_train_100.shape, X_train_1000.shape, X_train_10000.shape)
print (y_train_100.shape, y_train_1000.shape, y_train_10000.shape)
# now we will reserve the last 5000 images from MNIST dataset for validation
X_val = X_train[-5000:,:,:,:]
y_val = y_train[-5000:]
print (X_val.shape)
print (y_val.shape)
X_train_50
, the next 500 images into X_train_500
and so on.X_val
and its label y_label
.from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Flatten, Conv2D, AveragePooling2D, Activation
from tensorflow.keras.utils import plot_model
from tensorflow.keras import optimizers
# define a function to build the Lenet-5 model
def lenet5_model():
return Sequential ([
Conv2D(6, 5, activation='tanh', input_shape=(32, 32, 1)),
AveragePooling2D(2),
#Activation('sigmoid'),
Conv2D(16, 5, activation='tanh'),
AveragePooling2D(2),
#Activation('sigmoid'),
Conv2D(120, 5, activation='tanh'),
Flatten(),
Dense(84, activation='tanh'),
Dense(10, activation='softmax')
])
plot_model(lenet5_model())
lenet5_model
that stores the model.# define a function to train the Lenet-5 model
def train_lenet5_model(X_train, y_train, batch_size, epochs, run_name, X_val, y_val, X_test, y_test):
model = lenet5_model()
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
my_callbacks = [
tensorflow.keras.callbacks.TensorBoard(log_dir='./logs/run_' + run_name) #save run logs using callbacks at path = /content/logs
]
model.fit(x = X_train,
y = y_train,
batch_size = batch_size,
epochs = epochs,
validation_data = (X_val, y_val),
callbacks = my_callbacks,
verbose = 0)
# note! set verbose = 0 since run logs are saved (there's need to print iteration progress). Loss and accuracy will be plotted and tracked in TensorBoard
print ('\nModel trained using -> ' + run_name + ' evaluated on TEST dataset:')
model.evaluate(X_test, y_test)
return (model)
adam
is chosen as optimizer. It is one of the more popular gradient descent optimization algorithm that facilitates learning. The optimizer aims at minimizing the loss by finding the optimal way to update the model parameters during back propagation. sparse_categorical_crossentropy
is the loss function used to compute the loss between label and predictions when dealing with 2 or more classes, which applies in our case.callback
function. We use this to save the log of each model training run. In our case, more specifically callbacks
are used to plot the model convergence behaviour using TensorBoard
(a visualisatio tool from TensorFlow). Other application for callbacks
include ModelCheckpoint
(to save model in h5 format), EarlyStopping
(to stop the model training after it has reached a convergence), and many more.All of this process is stored under the function train_lenet5_model
. User will have to define:
X_train
= training data y_train
= training label
batch_size
= batch size epochs
= number of epochs to run
run_name
= to name the folder thats stores the log file (if using google colab log is saved at /content/logs/'run_name') X_val
= valdation data
y_label
= valdidation label X_test
= testing data
** y_test
= testing label
Note that we also include model.evaluate
in the function that will evaluate model performance against testing data.
model1 = train_lenet5_model (X_train_100, y_train_100, 64, 30, 'actual_image_size_100', X_val, y_val, X_test, y_test)
model2 = train_lenet5_model (X_train_1000, y_train_1000, 64, 30, 'actual_image_size_1000', X_val, y_val, X_test, y_test)
model3 = train_lenet5_model (X_train_10000, y_train_10000, 64, 30, 'actual_image_size_10000', X_val, y_val, X_test, y_test)
model1
, model2
, and model3
using 100, 1000, and 10000 training data respectively. All of these training data is actual images from MNIST.batch_size
is defined as 64. This means in every epoch 64 number of images will be taken from the respective input dataset to train the model.run_name
is provided to place the log files. # we can display the convergence plot using TensorBoard (visualization toolkit from TensorFlow)
%load_ext tensorboard
%tensorboard --logdir='./logs'
The TensorBoard tool allows us to inspect each model performance:
model1
(using 100 training images) shows high loss for both training and validation dataset and significant overfitting issue
model2
(using 1000 training images) shows significant improvement on both training and validation loss. Overfitting issue seems to have been overcome as well.model3
(using 10,000 training images) converges the fastest at epoch 10 with minmal loss for both training and validation dataset. Overfitting issue is successfully eliminated.Comment:
X_test
very well as it has been trained with sufficient amount of data.from tensorflow.keras.preprocessing.image import ImageDataGenerator
# use the ImageDataGenerator function from tensorflow to perofrm data augmentation
datagen = ImageDataGenerator(
rotation_range=45, # Int. Degree angle range for random rotation - from keras
shear_range=0.6, # Float. Shear Intensity (Shear angle in counter-clockwise direction in degrees) - from keras
zoom_range=0.4) # Float or [lower, upper]. Range for random zoom. If a float, [lower, upper] = [1-zoom_range, 1+zoom_range] - from keras
In the code above we define the rotation, shear and zoom parameters for data augmentation.
Due to the slanting variation in handwriting style as shown in picture below (taken from the actual images from MNIST dataset), techniques such as rotation, and shearing would be applicable. In addition, some handwritings style can be small or large. Therefore, the rescaling (zoom in or out) option could be useful too.
i = 0
for batch in datagen.flow(X_train, y_train, batch_size=10, #take 10 actual images (each digit) for 1 batch to perform augemntation
save_to_dir='/content/drive/MyDrive/Colab Notebooks/Assignment/test_image',
save_format='png'):
i = i+1
if i > 2:
break
We first tested on data augmentation to understand how it works using the code above.
Input data (X_train
) will be augmented using datagen
(created from ImageDataGenerator) function and saves it in a .png format at the specified path: '/content/drive/MyDrive/Colab Notebooks/Assignment'.
The batch size batch_size=10
will take 10 random images from the input X_train
and perform augmentation randomly.
Notice that data augmentation is performed through a loop. This loop will break after 3 times as we specified if i > 2, then break
. This will return 3 different augmented images (for each actual input images), with the augmentation parameters (such as rotation_range
, shear_range
, zoom_range
) that has been chosen previously.
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
img1 = mpimg.imread('/content/drive/MyDrive/Colab Notebooks/Assignment/_7_2861.png')
img2 = mpimg.imread('/content/drive/MyDrive/Colab Notebooks/Assignment/_7_1173.png')
img3 = mpimg.imread('/content/drive/MyDrive/Colab Notebooks/Assignment/_7_3115.png')
images = [img1, img2, img3]
for i in range(3):
plt.subplot(1, 3, i+1)
imgplot = plt.imshow(images[i], cmap='binary')
plt.axis('off')
plt.show()
# generate augmented data with control on number of images
import numpy as np
augmented_data = []
augmented_data_label = []
for i in range (10):
num_augmented = 0
for X_batch, y_batch in datagen.flow(X_train, y_train, batch_size=10, shuffle=False):
augmented_data.append(X_batch)
augmented_data_label.append(y_batch)
num_augmented += 1
if num_augmented >= 100:
break
augmented_data = np.concatenate(augmented_data)
augmented_data_label = np.concatenate(augmented_data_label)
augmented_data.shape, augmented_data_label.shape
batch_size=10
, the code above will take 10 actual images from X_train
and perform 100 data augmentation randomly on each batches in each run. X_train
and use data augmentation technique to produce 10,000 synthetic images from them.# to view the augmented data, we need to reduce its dimension to 3 (as required by plt.imshow function)
images_3_dim = np.squeeze(augmented_data, axis=3)
# plotting for the first 10 augemented data
count = 0
for i in range(10):
count = count + 1
images = images_3_dim[i]
# create subplot so that images are displayed in same line
plt.subplot(1, 10, count)
plt.imshow(images, cmap="binary")
plt.title(augmented_data_label[i])
plt.axis('off')
plt.show()
# plotting for the next 10 augemented data
count = 0
for i in range(10,20):
count = count + 1
images = images_3_dim[i]
# create subplot so that images are displayed in same line
plt.subplot(1, 10, count)
plt.imshow(images, cmap="binary")
plt.title(augmented_data_label[i])
plt.axis('off')
plt.show()
# plotting for the next 10 augemented data
count = 0
for i in range(20,30):
count = count + 1
images = images_3_dim[i]
# create subplot so that images are displayed in same line
plt.subplot(1, 10, count)
plt.imshow(images, cmap="binary")
plt.title(augmented_data_label[i])
plt.axis('off')
plt.show()
augmented_data
and its label augmented_data_label
to train our next model.model6 = train_lenet5_model(augmented_data,
augmented_data_label,
64,
30,
'synthetic_image_size_10000',
X_val,
y_val,
X_test,
y_test)
model6
is trained using augmented_data
of size 10,000 and validated against X_val
, and tested against X_test
(both validation, and test dataset are actual images from MNIST).%load_ext tensorboard
%tensorboard --logdir='./logs'
Comment:
model6
that was trained with 10,000 synthetic images (from only 100 actual images) have significantly lower losses, higher accuracy. Although overfitting is still observed, it is much more acceptable as compared to using only 100 actual images. model3
which was trained using 10,000 actual images.# generate the confusion matrix for model6
from tensorflow import *
y_pred = model6.predict(X_test)
predicted_categories = tensorflow.argmax(y_pred, axis=1)
print ('Confusion Matrix:')
cm = tensorflow.math.confusion_matrix(predicted_categories, y_test)
cm
# plot confusion matrix
import matplotlib.pyplot as plt
import seaborn as sns
fig, ax = plt.subplots(figsize=(10,10)) # Sample figsize in inches
sns.heatmap(cm, annot=True, linewidths=.5, ax=ax)
model6
trained with 10,000 synthetic images can predict all the the classes (different handwritten digits) well. This suggests that the data augmentation process created a balanced dataset, as samples are randomly chosen and augmented for each batch.loss: 0.9721 - accuracy: 0.7072
, which means the model is only good at predicting the limited images that it has been trained on, but would have poor model generalizability (unable to properly predict unseen data). In contrary, model trained with 10,000 actual images converged to loss: 0.0921 - accuracy: 0.9747
.loss: 0.1980 - accuracy: 0.9499
when tested on 5000 actual images. We learn that the model performs significantly better than the model trained with using just 100 actual images. However, the performance is still not as good as the model trained with 10,000 actual images. We suspect that there are still variation in the handwritten styles that has not been sampled for data augmentation.LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278-2324