In this video, I will be using a convolutional

neural network, implemented with Keras and written in Python, to recognize handwritten

digits and perform basic operations between them. I simultaneously aim to give you an

intuition for, and understanding of, convolutional neural networks and their awesome potential.

I will start off with a demonstration of the program. I enter “13” one digit at a time,

clicking “save image” after each digit. Then I click “multiply”, and enter “14”,

one digit at a time, clicking “save image” after each digit. When I click “equals”,

I get the correct solution of 182. Pretty cool, huh? The core of this program is a convolutional

neural network. Convolutional neural networks are loosely based on the manner in which all

mammals perceive the world around them. This manner involves a hierarchical series of feature

recognitions, starting off with simpler features like diagonal lines, curved edges, etc., and

progresses towards complex or abstract recognitions, like combinations of shapes, and finally,

the classification of entire objects. That’s pretty simple, but how would you mathematically

model this process? Let’s dive deeper into how convolutional neural networks do what

they do. Like the standard neural networks discussed

in previous videos, convolutional neural networks involve input layer neurons, weights, biases,

hidden layer neurons and output layer neurons. However, they are also a bit different. Let’s

begin at the start of the convolutional neural network and work our way through it. Inputs

are a 2-d matrix for black and white images, and a 3-d tensor for colored images. BTW,

tensors are just arrays with dimensions that are higher than 2, and the 3rd dimension in

a colored image is for the different color channels, usually red, green, and blue in

an RGB image. In our case, images are black and white. Next, weights are arranged into

2d matrices. This differs from a regular neural net, where inputs and weights are scalars,

or just single numbers. In a regular neural net, you simply multiply inputs by weights.

How does this work with matrices or even tensors? Instead, the dot product is computed. Essentially,

to perform the dot-product between the image, which is a matrix, and a smaller matrix, which

is the weight matrix and is often called a filter, you multiply the filter by each corresponding

value in a section of the image, sum these values, compute the average by dividing by

the number of values added, and place this in the corresponding position in the resulting

matrix. You then repeat this process, sliding the section of the image that is multiplied,

or the receptive field, until all the values for the resulting matrix are found. Note that

this resulting matrix loses one pixel at its edges in every direction, and programmers

sometimes perform padding, in which the lost pixels are replaced with zeros. When performed,

padding could make the resulting matrices 16*16 instead of 14*14. Also note that initially,

like any other weight, filters contain random values and are adjusted in training. Performing

these repeated dot-products in this manner is known as convolution. Convolution plays

a central role in a convolutional neural network, hence, the name convolutional neural networks.

To understand why, let’s take a look at two examples, one simple and one slightly

more complex. I’ll start with the simpler example. Say that this is the input image,

and we apply this filter to it. This filter can be thought of as a low-resolution horizontal

line, and look what happens in the resulting image: horizontal lines are kept, and almost

everything else fades into a dark gray and is ignored. In this way, filters keep what

is relevant and ignore everything else. ‘What is relevant’ is adjusted based on the needs

of the neural network in training. Mathematically, this occurred because the resulting value

in the matrix was large if and only if the receptive field was very similar to the filter.

If they were not similar, then values in the image matrix were multiplied by values close

to or equal to zero in the filter matrix or vice versa, making the resulting value small.

So far in this video, filters have contained only positive values. This, however, is a

simplification. Like in standard neural networks, weights in a convolutional neural network

can contain negative values. For these two examples, when visualizing any values in filters,

including negative values, I set the lowest value in the matrix to black and the highest

value in the matrix to white. While visualization is a powerful tool, it is also important to

consider the numbers, because the numbers are all that the computer sees. In this next

example, I will use negative values in the filter and a more complex image. This is the

new filter. While it looks exactly the same as the filter that was just used, it contains

-1s where there were once zeros. The image itself is much more complex, with many details.

In the resulting convoluted image, the horizontal edges are highlighted and most other information

is lost. The general principle of how certain values are kept or even amplified and others

are ignored through convolution remains the same–that is, if the filter and the receptive

field are similar, then the resulting value is large, and if they are not, then the resulting

value is small. However, negative values differ from zeros in an important way. With the filter

that was used in the first example, the values in the receptive field corresponding to the

zeros in the filter don’t matter–they can be zero, or any other value, and since any

value multiplied by zero is zero, the resulting value will be zero. So, with this filter,

you will keep any horizontal line, regardless of what values are above or below it. When

the zeros are replaced with -1s, however, the filter will preserve a horizontal white

line if it has low, or ideally, negative values above and below it. This is because if the

values above and below the line are large, then large values will be made negative, making

the resulting sum much smaller. Similar logic can be used to explain why a horizontal black

line surrounded by large values will be preserved. Hence, this filter may look for horizontal

lines making up an edge, if the edge has a height of one pixel and is surrounded by values

that contrast the line. This fact explains the emphasis on edges rather than any horizontal

lines in the resulting convoluted image. In this way, negative values in filters allow

the model to emphasize a wider variety of features. In a convolutional neural network, it is almost

always advantageous to have multiple filters, in order to highlight multiple different features

separately, but more on how this plays out later. The next step involves making the output,

after applying the filter, have a lower resolution, but still retain the most important features.

The point of this is to make the neural network more efficient and less computationally expensive,

especially during training. This step is pretty simple and is called pooling. First, you define

a pool size. This will be the dimensions of a section of the matrix that you will compress

into one value. Perhaps, the most popular method of this ‘compression’ is simply

taking the maximum value, where the value represents the pixel intensity. In a black

and white image, which is what we will deal with, the value represents how white a pixel

is. So, the network simply slides across the image, taking the maximum value in a certain

area. Note that for both convolution and pooling, there is something known as a stride length.

Stride length represents how much you move the window after performing convolution or

pooling, in both the horizontal and vertical directions. For convolution, the stride length

is usually 1 by 1 or 1 horizontally and 1 vertically, and in pooling, the stride length

is usually the same as the pool size, which is the area from which the model takes the

maximum value. Changing the stride length from the standard values is usually unnecessary

and importantly, will change the dimensions of the resulting matrix. Finally, we apply

an activation function to all the values in the now compressed matrix. Remember, an activation

function serves the purpose of allowing a neural network to approximate a non-linear

function and this remains the case in convolutional neural networks. By the way, you could technically

apply an activation function before pooling and directly after convolution, however this

would be slightly more computationally expensive, as the matrices will be larger. These 3 parts,

or convolution, pooling, and activation can be repeated to further add complexity to the

model. Note, that you don’t necessarily have to implement these parts precisely in

this order. As an example, it may prove beneficial to perform convolution, activation, convolution,

and only then pooling. What happens when you add more convolutional layers is really cool.

Say that I start with a 28 by 28 black and white image and I have 32 filters in my first

convolutional layer. So, I now have 32 26 by 26 matrices, assuming that I don’t perform

padding. Whether I apply pooling and activation is irrelevant to this scenario. Let’s say

that I don’t, and I add another convolutional layer with 32 filters. I actually don’t

get 32 times 32 or 1024 new matrices after I apply this convolutional layer. So, I don’t

apply 32 filters to each of the previously filtered images. Let’s break down what actually

happens. Each of the 32 filters will actually have

a depth of 32, corresponding to the depth of 32 in the previous layer’s outputted

tensor. From this point, the dot product is calculated between the receptive field in

the first matrix and the first matrix in the first filter, between the receptive field

in the second matrix and the second matrix in the first filter, between the receptive

field in the third matrix and the third matrix in the first filter, and so on. Then the average

of all these values is computed is placed in the top left position of the first resulting

matrix. Then the rest of the values in the first matrix are computed by sliding the receptive

field across the matrices, computing the dot product, computing the average value, and

placing it in the corresponding position in the first resulting matrix. So, the first

resulting matrix was generated with the first filter. This process is repeated with each

of the filters, until there are 32 resulting matrices. Essentially, 2d convolution between

a matrix and a filter is repeated to account for the depth of both the tensor and the filter,

the average of these values is computed to make the result a 2d matrix, and this process

is repeated with each of the filters. By performing convolution and averaging the

values in this way, I combine the highlighted features in a computationally inexpensive

way. By doing this, we add a level of complexity that will force the model to make progressively

more and more intricate feature recognitions or highlights as it finds the optimum filter

values deeper in the convolutional neural network. So, deep in a CNN, where convolution,

pooling, and activation has already been performed quite a few times, a filter that looks like

a horizontal line, for example, will likely result in the highlighting of a much more

complex feature in the original image. Generally, by the last convolutional layer, each filter

will represent a relatively complex feature. I say “relatively” because depending on

the scenario a curved edge could be a complex feature, like in our case, or an entire human

face could be a complex feature. While all this may be slightly complicated, it is important

to remember that at a high level, we are still just computing a product between an input

and a weight, and because of this, eventually allowing the neural net to find the relationship

between input and output. Then, you flatten all the pixels into a single

vector. Remember, a vector in programming is a one-dimensional array. You connect each

pixel to a neuron in either another hidden layer or the output layer. Remember, that

by this point in the convolutional neural network, a pixel should represent the presence

or absence of a relatively complex feature. In this way, we add a fully connected section

to our neural network. This is done in order to take advantage of all the different features

that have been learned. In other words, the section of a Convolutional Neural Network

with convolution, pooling, and activation performs feature extraction and in the fully

connected section, the network finds the relationship between the presence or absence of these features

and the different classes. Like in the last video, with breast cancer diagnosis, the output

will be a set of probabilities. These probabilities represent the chances that a particular image

belongs to a certain class, according to the model. You can then take the digit corresponding

to the highest probability and voila, you have a predicted digit. Initially, as with

the neural networks dealt with in previous videos, the predictions are inaccurate. However,

accuracy is gradually improved by minimizing the loss or cost during training. For training the model, I use the MNIST dataset.

This dataset is popular and widely used among programmers. It contains 70000 handwritten

digits that are already split into training and testing data, with 60000 training digits

and 10000 testing digits. Each digit is a 28 by 28 pixel, black and white image. Each

of these images obviously contains a corresponding correct, human-made label. Let’s begin writing the CNN in Python with

the help of Keras. First, as always, I import all the packages that I will need in order

to implement the CNN. Then, I load the MNIST dataset. From where? Actually, Keras has a

copy of it because of how popular it is. Also, as I previously said, the dataset is already

split into training and testing data, so all I have to do is assign it to X_train, X_test,

y_train, and y_test. Remember, that X_train and X_test are the images, and y_train and

y_test are the classes or labels for the images. Next, I reshape both X_train and X_test to

-1 by 28 by 28 by 1. What in the world does that mean? Well, 28 by 28 is simply the resolution

of each of the images, and 1 represents the single color channel. The -1 is a special

number that essentially tells Keras to figure out what the actual value is. The actual value

will be the number of images assigned to that variable. So, the only thing that will actually

change in this line is the addition of the one color channel, and this is added to match

the format expected by Keras. Next, I apply to_categorical to y_train and y_test. For

a complete description and explanation of what this does, see my breast cancer diagnosis

video. Basically, it converts the classes in y_train and y_test into a form readable

by Keras. In the following lines, I normalize the input data by taking each pixel intensity

value, which is currently between 0 and 255, and dividing by 255. In order to do this,

I must first change the type of the values in these numpy arrays to float32 so that they

can contain decimals. Then, as always, I define the model in the line model=sequential().

From here, I define the input shape as 28 by 28 by 1. Note that since we only input

one image at a time, the input is 3 dimensional, whereas X_train and X_test are 4 dimensional,

with the 4rth dimension being the image index. In the same line, I define 32 3 by 3 filters,

and even though I don’t explicitly write it out, Keras includes a bias since this is

the default. I also don’t choose to change the default of no padding after convolution,

simply because it is not necessary. Next I add a pooling layer, and more specifically,

a max-pooling layer with a pool_size of 2 by 2. There are actually other types of pooling,

like average pooling, for example, in which instead of taking the maximum value in a defined

area, you take the average value. In truth, there isn’t all that much of a difference

in the resulting performance between the two, especially for a simpler example, like the

one that we are dealing with. As mentioned earlier, you can change the stride length

in both pooling and convolution from the default, however, this is unnecessary. Hence, I use

the default values. Then, I apply the relu activation function. After this, I flatten

all the matrices into one very long list of values, where each value represents a pixel.

Then, I add a dense layer with 128 inputs. So, the value of each pixel will be connected

to each of the 128 neurons in this layer. This connection is often computationally expensive,

especially with much higher-resolution images. I also apply the relu activation function

to each of these neurons. Finally, I add an output layer with 10 different neurons. Each

of these neurons will represent a class, or, more specifically, a digit from 0 to 9. I

apply the softmax activation function, which is used in multiclass classification problems

like this one, as discussed in the last video. Finally, I compile the model. I use categorical

crossentropy as the loss, which is used in tandem with the softmax activation function,

I use the Adam optimizer, which will dictate how the weights are updated during training,

and I state that I want to keep track of the accuracy metric. Now we have defined the structure

of our convolutional neural network. Next, I perform training in the line model.fit.

I pass X_train and y_train_cat, and I define the batch size, epochs, verbose, and validation

split. For a complete explanation of all of these, see my neural network regression video.

Finally, I evaluate the model and find that the accuracy is good. And that’s it. I have

now written a very simple Convolutional neural network in Keras. With this simple dataset,

there is very little that I can do to improve the model’s performance, let alone by a

substantial amount. For example, I could add more layers to the model, or I could implement

dropout, however, none of this is necessary and again, it doesn’t have a substantial

effect on the model’s accuracy, which is already quite high. The next section of the code is responsible

for the GUI, or graphical user interface, that you saw at the start of this video, and

for the operations performed between the numbers. Basically, here’s how it works. First, the

program opens a window, into which the user can write with their mouse. Whenever the user

clicks, holds, and drags, the program will plot circles in a location corresponding to

the position of their mouse. The program keeps track of every location in which a circle

was plotted, so that an exact copy of the displayed image can be made, that will be

inputted to the convolutional neural network, when the user clicks ‘save image’. Before

being inputted, however, the image is resized to 28 by 28, or the dimensions of images in

the MNIST dataset. The program also makes the pixel intensity values float32 variables

and divides each of them by 255, just as we did when we trained the network. Then this

slightly-modified image is inputted to the network and the program takes the digit with

the highest probability in the network’s output. This digit is then displayed. If the

user clicks on the ‘click here if the number is incorrect button,’ then the program will

clear the current screen and will later replace the digit with a new one, when entered by

the user. A user at this point can either enter the second digit in a number, in which

case the process will be repeated and the two digits will be joined together, or the

user can click an operation. When the user clicks an operation, the program stores the

name of that operation in a variable. After this, the initial process is repeated and

the user can enter another number. When they click equals, the program simply performs

the requested operation between the two numbers and prints the result. If the user happens

to click the ‘reset’ button at any point in this process, then all the existing information

will be cleared or overwritten, and the process will start over again. And that’s it. If

you want to try the program out for yourself you can download the code on my github, for

which there is a link to in this video’s description. Note that if you do so, you will

need to install pillow in your virtual environment. This is done with the following command. And

that’s it! We now have a basic, functional calculator that can recognize handwritten

digits and perform simple operations between them. In case you didn’t already know, this video

is part of a series in which I find really cool datasets like this one, and for each

of them I show you how to implement a Neural Network. In doing this, I hope to both entertain

and educate you. Subscribe and click on the notification bell to be notified when I release

a new video, and also hit the like button and leave a comment if you want this video

to reach more people. Thanks for watching.

Nice I never had an idea like this thanks đ

Great explanation, thank you

Aye

Best vid ever

Great video! Very clear, crisp and elegant presentation!

Is tensorflow only work in 64bit python ? I have conda installed(64) in my 64bit win 10 but python is running there is 32bit and i am getting err while doing pip install tensorflow.

Best CNN explanation video I've seen. Well done, Antonio – keep it up.

Instant subs!

This is great!

Minor corrections: (1) Padding is performed before convolution, not after, as I made it seem at 2:57. (2) At 1:40, I say that tensors are arrays with dimensions that are greater than 2. This is not entirely correct. Tensors can be 0-dimensional, 1-dimensional, and 2-dimensional, although tensors of these dimensions are more often called scalars, vectors, and matrices, respectively. Beyond 2 dimensions, 3-dimensional arrays are called 3rd order tensors, 4-dimensional arrays are called 4th order tensors, and so on.

Good work Antontonio

Nice animated video! I'm curious what tool you used to make this ?

Proudly I can say this is the best CNN explanation ever, we'll done