Hello everyone, welcome to our computer vision module comes 338 and today we will continue on convolutional neural networks. Deep learning and today we will introduce network training.
Still have a quick recap of what we have covered in the previous lectures. So in the previous lectures we have covered convolutional neural networks, so we have introduced the basics of convolutional neural networks. So we have seen these conclusions. So basically it's the same as what we have seen before the futures.
The convolutional features an we have seen mask and then we have seen how to apply this conclusion in convolutional layers. We have since tried. We have some padding and we see how to calculate the article size based on strength, size and padding size. OK and we put the cut. We put everything together with some polling layers and also Relu activations and also.
One final fully connected layer, and so we have got the full network for convolutional neural networks.
For today, so we will continue on these convolutional neural networks and see how to train the neural networks OK.
First of all, so let's have look at the pipeline of training the neural networks.
To have a quick recap of what object recognition framework so you have seen it in our lab sessions in our previous lectures and also you have practiced in our first assignment. So for object recognition so we have two faces. We have training and we have test.
So for the training so this pipeline, so we have a lot of data, so we trained the the model on these data so we have seen this apples. We have seen these pairs and we see some other objects.
So for the assignment one you have seen, we have some example images for each class, so like car or person for each class. So basically it's it means the model will see these classes in the training. OK, so that it can recognize this class in the test. OK, that's what we do here. So we have this training images to recognize to train this model. So we need to learn.
Some features. OK, so we we have some image features from each image in the training image.
So this feature can be saved.
So this is for like handcrafted.
Features OK, so that's what we had for the traditional computer vision pipeline, or it can be this convolutional layers. OK, so they in CN they are layers of convolutional layers.
OK, well we can use these convolutional layers to learn features from the images. These features can be, corners can be lines, but usually we use convolutional operations to learn these features by themselves by the network themselves. We don't design specific features in this process, we just uses different layers of conclusions to extract the features.
And then we have these training labels for each object. So for example, this is an Apple and.
Practically so we only give a class label for this for this class. For these classes, like 1, two or four five, you have seen it see them in the. In the assembly one.
And for the training, so we have these classifier to classify this objects.
So for the handcrafted features.
We cast them and we guess one dictionary and then we use one histogram of these features of these clustered features to represent these objects. OK, and then we can use KN or we can use like bears classifier or like a softmax classifier to classify these histograms OK and then we get this classifier but in convolutional neural networks or in.
Turning so we have one fully connected layer to classify and there's learning features. OK, that's the difference between.
Scan graphics for the handcrafted features and for the deep learning approaches.
An but anyway we have one classifier OK.
And then we get one model for classifying these objects and when we have this test image, we do the same. We extract features either by safely descriptors or these convolutional knew layers and then we get it through the learn the learns classifier. It can be the KNN classifier in traditional conversion pipeline. It can also be.
The fully connected layer in one convolutional neural network and then we predict the label of the input image. OK, this is an image. We get the prediction.
That's the whole plan. Plan of Aubrey reclamation.
Ann for deep learning. For commercial unit neural networks, we will follow these pipeline. OK we have training. We train the network and we do the classification OK but for deep learning we can update the ways to optimize the network. OK so we have this loop so we have iterations to optimize the network that can do better and better. OK for this data set.
So for the traditional come provision tax we have these. Can we specify these hyperparameters manually? OK, so we can specify K equal to 1 K equal to five. As you have seen in the assignment one but for deep learning for convolutional neural net neural networks we have the network to learn by itself to update the ways through this iteration.
So that's why we have loops. We have these iterations.
So first we example a batch of data from the training, so we don't use all of them for the training, for each. For each loop we take some samples. OK, so the sample can be like.
For or is from these from these training images.
So we get a batch of data and then we input this batch of data examples into the network. So we forward propagation and this did this data through the network and then we get loss. So here this is 1 very important concept in your network training. So that's the loss.
So in the traditional computer vision pipelines, so we have accuracy to evaluate how this classifier performs. So for example, like how actually it can predict. So that's what we had, right? So in the evaluation metrics in our previous lectures?
Also, in deep learning we have loss to evaluate how well this network performs.
This batch of data. OK so we have loss.
And if it is lost, so we need the ways very Mars. If is the losses that it means the ways the current ways do well for this batch of data. OK, we don't have to adapt very much for for the ways and then we back propagate.
There is a loss to calculate the gradients, so that's the bad location. OK, we back propagate the loss back to the network and then we adapt the ways to make the network can perform better on the data. OK, so that's the back propagation.
And then we update the parameters using the gradients. So this step, this is the backpropagation.
Right?
So that's the one we have seen in artificial neural networks.
So you can save this is this is the training process? OK, so we take some data and then.
We use this data to evaluate whether the ways in the current network performs better well or not. Well on this data.
And then we get the loss. Now we buy propagates the loss through the network and then we adapt the ways according to how well the network performs on this batch of data. OK, that's the whole training process. That's how we train the network.
So for each step we will give a more details. First of all, how to sample batch of data. So that's the first step, right?
So you ready we need to do some data pre person pre processing. You may ask the question as we sample some data from the training images from the training data so we may have some bears. So for example for one batch we may take a lot of.
Data for one class or an and we may contain much less data for another class. OK, right? So which means the distribution of the data in this batch may be quite different from the distribution of data in the whole training data set, right? So that's why we need this data prepared data preprocessing. OK, so we have data normalization to normalize.
The dispatch of data.
So you can see there is originel data so early is offset to the center and the variance would be quite quite.
It will vary from batch to bash, So what we can do is first we their centers data. OK, so something similar to what we did for principle components, an analysis with zero center the data so you can see it has a mean of 0.
So that's why we call it a 0 center data.
And then we can normalize the data so that the.
Standard deviation would be equal to zero OK, and that you can see the standard deviation is quite large before, but here we normalize it to one.
So in this case we get a normalized data.
An practically we use 2 lines of.
To achieve this data normalization so you can see here. So we get the mean of the input data X with. This means at 0. So you can see. So we get 0 center data and then we get this.
We get this normalized data so you can see here. This is the subscription, so you need to subtract the original mean of the data so that you can get the mean of the input data at 0. So that's how to guess the zero center data and then to normalize the data. You can see here. So we divide the original X by the standard deviation is standard division.
So that we can guess is the odd force is a standard standard deviation of the output as one.
So that's a whole week as the normalized data.
And for so now we have guessed this.
Number data normalization. We get the normalized data, but another problem for the data in deep learning is we we always say it's the deep learning is data hungry so we need a lot of data. So wish me so I can clean in this way. So for example we have this closed so we may take three photos of this cluster. So from 3 angles.
Which means it may not be able to cover some other angles.
So to make the model to say these variations so it can cover different angles so we can do data augmentation to augment the datasets. So for example. So here we have three photos of three different angles. If we do a data augmentation we can take, we can get like 10 angles or 100 angles by like.
Changing the angles of these photos so that we can, we can have flipping. We can have skill. We can have rotating. So by rotating this this available data so that we can get some new data so that we can get some new angles of the Clock tour. So that's how we augment the available data so that we can get a larger data set.
So this can make the model to see more variations in the data so that in the test case in the test phase, the model can better predict the classes of the input right? So that's why we do data augmentation.
So you can see here we have one original image and we do flipping so the shoes will be flipped.
So which means we don't have to collect new data, so we just do some manipulation of the of the data so you can see the model if we have another image. So the shoes is in this way, so it may not be very actually along with this angle bus it will be similar, right? So it means the model can recognize this is the shoe.
OK, so it means we can augment the training data also on the right example so you can see we have one input image of a cat and we can do this flipping. We can do scaling, we can do rotations so that we can get new training data. So we augment the training data size so that the model can be better able to recognize knew.
Variations of these objects OK, it can be a different angle. It can be a different scale. So in any way we augment the data size so that it can better recognize these objects in the test.
And as we have a batches of data so for training so we only call it we have a box.
So it means is sample in the training data set has had an opportunity to update the internal model parameters. So in this way so we can phase one batch into the into the network and we can update this.
We can we can update this network using this.
Using using this, this Patch of data so for.
If we face if it's one batch of data into the network and we do this feed forward and backward information propagation, so we call 1 loop as epoch as one epoch.
So you can see here so one epoch it can consist of 1 batch or it can consist of more badges. OK, that's usually we call it like we do the training for one loop, OK?
041 loop to update the model parameters to updates. The weights. OK, that's how we call it 1 April.
And really, we have this mini bus, so a baby training sites so a batch of the data and we can make updates using small part of data.
So while one of the reasons why we don't take all of the data into the network is usually we have the data set that is very large, so it can be sold and of images, or it can be millions of images. We can't feed all of the training data into the network, so that's why we can away fit small batch of the data datasets into the network. Another reason is.
If we face a small batch of data so it can be more efficiently to trend datasets to train the network OK.
So when training as a data, the data is split into small batches. Each batch is a minimize.
And you can see the relationship between the mini batch and the whole training data set, so you can see here the size of the mini batch usually is greater than one, But for sure.
But you can take one as minimize, but that's what we don't do. Otherwise, if you take me by the size of a mini batch as one, so you only import one example, one training example into the network, you can't guess the the distribution of these data in the data set, right? That's what we we don't do in the training.
And the size of mini batch is smaller than the size of training data. And here I give you one example.
So for example, here we have our training data set that has.
So they choose so the instances and if the size of a mini batch, so we call it. Batch size is set to 32.
So then it will be 1000 mini batches OK, and if it's a pork consist one batch so it means the network will the ways of the network will be updated 1000 times OK?
Other example, if we size the size of the mini batch or the batch size is set to 16. So which means so we will have.
2000 mini batches
right and the ways of the network will be updated for 2000 times.
An if we take one as a batch size as eight, so it means.
So we will train the network for 4000 times.
And we will have 4000 mini batches. OK, every time we face it.
Training examples into the network and we update the network once. OK, that's how we relate the mini batch to the whole data set.
And again, this is the training and this is the forward pass, and backward pass is the very same as it actually is, the same as what we have seen in the back propagation in the artificial neural networks.
So you can see here we have these training an we have we take some data from the from the training data set and we get for example we get one batch of eight images and we face these images into the network. Now we have the forward loop and then we get these predictions. OK it's a dog is a person or is a Apple?
And so we compare these labels. This predicted labels against the ground truth. Actually, these images dog. This images person would compare this predicted label against the ground truth labels, OK?
And we get the error.
So if.
The prediction is the same as the ground truth. We take an error at zero as the prediction is right. If it's not right, we need to get get the error of value a number as the loss as the as the error, and then we buy propagate this error through this network and then we update the weights according to this error.
So that's the forward and backward.
Passes.
But here there is one question, what is the?
Error and how to.
Updates there is ways according to the error, so that's the question of training.
So you can see here. So for the forward pass. So for the forward propagation. So through the network and we get the loss or we calculate the weekends the error.
And you are right, we will an initialized weights so it can be a random number. OK, so here. So this is the question. You may ask how to update the weights if you don't have a ways. So that's what we do. We randomize the ways of the network 1st and then we optimize this ways. OK, that's why usually you will get.
Very different performance at the beginning of the training. OK, as we have randomized ways so for this Apple if we train this network from scratch for the first epoch early it will vary. Allows as we have a randomized ways.
So encoding in programming, that's what we do. So we generate one random number using nampi. OK, getting these random number and we get one. We make it smaller. Usually we want to get a small ways, otherwise the wait will be too large as we will work as overfitting. So we get this zero points there one and one random number.
That's the random and that's that's the initialized weights for the network, and then we.
Propagate the information from the airport to the prediction. OK, so we go through this network structure. We pass the information from input to output.
And once we get the output, which is the prediction right, we compare the prediction against the ground truth and then we get the loss function.
How well this network with this setup ways perform for these tasks and we can start loss function OK?
Before we go into the loss functions, so I would like to see about the initialization always. Actually this is.
Active area of research in the recent years there have been a lot of works that has been proposed to address this in the initialization problem, so there are few different papers on this topic. If you're interested, you can check out these papers.
So now is that what introduced the loss functions? So just to have a quick recap of the application learning, we have just the missions. So we have to.
Passes so the first part is feedforward, so we go from the input to the prediction. So the information is feedforward.
So you can see here we have X and we learn the features from the inputs. So actually we are doing this linear combination and nonlinear functions of the inputs, so you can see here we have the linear combinations of the inputs that we get some features and then we use activations this resolution to get nonlinear functions nonlinear features an another layer of linear combinations and then we learn a lot of.
A lot of different features so you can see here actually for convolutional neural networks as we mentioned before this future, these convolutional features there actually there's a linear combinations these weights.
And all of this right? So that like we explained in the last lecture.
And then we get the prediction that is the here. We get the prediction and then we get the loss previously in the artificial neural networks we have introduced the loss it can be.
So it can be the difference is that the square of the difference between the predicted value and the ground truth value. OK, so for example, is a stock price. We predict it will be 100, but actually it is 120. We get the difference of of 20 OK and we get a square as the loss.
Once we get the loss, we propagates the loss through the network and we updates the weights accordingly so that so that the network can perform better to minimize the loss again.
But for computer vision we have different loss functions and so we will see in the coming slides.
Before we go into this loss function, so I would like to introduce the softmax function, which is the base function for the loss for the loss function of classification. OK, so you may remember you may remember this softmax regression in lecture 15, so this is our order front and for the deep neural network we also have softmax function.
So the best thing?
Or suffer Max function. Softmax regression it is to squash the values into the range of zero and one.
So for the softmax regression, so we have a sitter X.
Divided by the sum of all of this, is it a X right? So that's the softmax regression and all of these outputs? It will be in the range from zero to 1.
Bye.
So which means is perfect for representing probabilities as probabilities.
Up in the range of there are one.
So the probability of one event it can be zero. It can be 0.1. It can be one.
So it will not go beyond this range, right? So that's the perfect function for getting the values into the range of zero and one. So this is the same for the softmax function in the neural network. So you can see here we get this hidden features this number.
As the import and we scores, there's a number into the range on there to one, so here's a diagram. So we have these input X, so we have a few layers like linear layers and activation functions, and to learn these features and then at the end you may remember we have this fully connected layers. OK, so the fully connected layers will will.
Predict the scores of each class so you can see here it can be two. It can be 2, one. It can be 0.1.
Plus these numbers they count represents the probabilities. OK, we can say OK so for 80%.
For an 8080% sure this is an Apple.
So is that possible to convert these values into this person in these probabilities? So that's why we use softmax functions again, so we use this softmax functions to map this number 2.
As their .5 seven and one we get there .2. So basically what we get is OK, so that's.
E Two 2 / 8 Two 2.
Plus
heat to 1.0.
Plus
sorry.
It is their .1.
Right and then we guess.
Zero points.
Seven, so that's the number.
And for the second case. So here we will have.
E2 one .0.
And then we get the .2 and for the third case here we have 0.1 and then we get.
As their .1 again, that's how we get. So in this way we convert then there is.
Scores into these probabilities.
OK, so in this example we have three classes OK and we have these inputs and we have these predictions for these classes. So it means.
The model is saved something percent. Sure this is for the Class 0 and the model is 20% of. Sure, this is the class one and this is a ten person. Sure this is.
Class two OK, that's the softmax function. So here we predict the class of the inputs. So how can we get the loss so early? For classification we use cross entropy.
So the cross entropy is based on softmax function, so we continue from this one. So here in this example we have three classes. We have CAD, we have car. We have frog as in the last slide we have predicted. OK, this is 7%. This is a cat and something 20%. This is this is.
This is a car and 10% OK. This is a frog. That's our prediction. OK and these numbers are the probabilities of assigning this input image in to this to. This class is OK.
And in the training as we always mention in the, this is a supervised learning problem. Classification is a supervised learning problem. We have these ground truth labels for these inputs. OK, so we have we know OK, this image. This is a cat. We have these numbers for.
For for the inputs.
So far for imports of cash so the other cracks probabilities, it will be OK. We are 100 personal. Sure this is a class OK and then we compare these prediction and the ground truth using the cross entropy.
Loss the cross entropy is to evaluate these loss. This information loss for for this different classes and that's how we guess for.
For the cross entropy, actually this cross entropy comes from the information theory. OK, you can take off the backgrounds and more details of cross entropy in Wikipedia or online materials. By here you just need to know how to calculate cross in cross entropy. OK, so for this example, so we use is this equation so you can see we have three classes.
And then we will calculate for each class will calculate these numbers and you can see for the first class. So we will have a ground truth of 1.
And then we have log.
Is predicted probability so here.
So it has their .7.
So that's the first class, and for the second class we have ground truth prediction.
It has zero probability and this is a car, so here and it will multiply the log of the probability of the prediction. So we have 0.2.
Here answer class so you can see.
The prediction is the ground truth is also zero. Their probability it will be a frog and it will multiply log 0.1.
And then there was some the map.
So we get a loss.
Manners. They actually here. Remember there's a manners.
Manners log 0.7 again.
We get the loss for this.
So you can imagine this if the motor is perfect, so we'll predict. OK, this is.
So we have 100%. This is a cat.
Which means the loss will gas.
Log 1 right?
Minus log one.
So the loss will be 0 if the model is doing very well.
But if the if the model is not doing well, so this number will get invalid.
Small.
Which means.
This law, this loss will get.
Larger.
So that's how it works.
If the model is doing well, so the loss will be smaller if the model is not doing well, the loss will be bigger. OK, that's why we use this cross entropy to do the to use it as a loss for classification.
But for sure.
We may have other tasks so as this one we have multicategory classification will predict the the classes of the inputs but but for some other tasks we may have real number prediction.
So for the classification.
For the classification, so they are the.
Sorry.
For the classification, they're predicting, these labels are 1234 this these classes.
But for tasks like landmark prediction, so for this example.
So we have different landmarks on the faces so that we can recognize the eyes, apples, malls. So we have this locations of these landmarks in the face, so this locations will be real numbers so it can be one. It can be 1.4. It can be 1.5 the real numbers. So for such case so we will have this prediction like 1.5 and for the ground truth.
It can be 1.4, so in this way we can use the loss we have mentioned in the artificial neural networks. They use mean squared error. OK, so we get a difference and then we get the square and we sum up all of these differences. There is a loss so we get the mean square mean squared error.
So for the loss and then we can use this this loss to update the weights of the network.
So that's another case.
So now we have seen the losses and then we want to optimize the networks and there are some other terms or we need to new so we have introduced the back propagation. We have introduced the gradient descent, but this there's a lot of techniques to do the gradient dissent for bifurcation learning. So there's a term we call it optimism.
So if you train on network in this.
Deep learning frameworks like Tensorflow or petals. You need to set up this optimizer so the optimizer is to optimize the model based on the cratering of minimizing the loss. So you have seen the loss and the automater is too OK to minimize the loss. In some matters we have different ways. OK, so the back propagation we have seen in the previous lectures there quite simple. They are quite like.
Straightforward, but in the real practice we only have two. We only have to set up different optimizers, and some may work for some scenarios, some not so early. We have some options so early we have stochastic gradient descent. We randomly updates the gradient decent.
And we have different variants for STD, the stochastic gradient dissent. Now we have.
As Proban also we have adaptive adaptive.
There is a learning rate adaptive gradient descent. We call it Adam, or which is the which is one of the most popular optimisers for deep learning networks.
And also so to updates the parameters. So you may still remember in the artificial neural network.
Sure, so we have.
Use this example so we start from one point and we follow this flow and go to the next.
Please and we have this stepsize.
By two updates there's a network OK?
So we follow this flow. Follow this slopes and or arrive at the bottom of the function.
And one important concept is how to.
Update the stepsize.
And you really has the beginning. We may have a large step size. From here we may go to here as we have a very large loss.
And we want to optimize the function quickly, but if we approximate if we approach the bottom of the function, so we may be careful to update our ways, so that's why we want to have a smaller step size, OK?
So the step size in deep learning, we call it a learning rate. OK, so how quickly we want to update?
Update our ways.
So you can see here so to update our trainer trainer parameters so W&B. So usually after the application, usually that's what we do.
So it's the same as what we did for artificial neural networks, so we have W manners learning rates. This is the step size OK. Plus the gradient of the ways.
There's staff and also we have these beat the offset.
B matters learning rates multiply the gradients to the gradients.
Four days off that.
Had also or we need to select a different learning rate if we have a large learning rate. So for example here if we have large learning rates, we may go over here. OK, if we have a large step size.
If we have a small step size or small learning rate, the training may be very slow. OK so you can see here very high learning rates. We will get a very high loss after a few epochs and if we have a low learning rates the training will be very slow, OK and if we have a good learning rate so we want to minimize the loss as mostly after we have.
A few a box.
So all of this, we just have a quick preview of this learning rates and the optimizers. And for the next lecture so we will introduce more details of optimizers and also learning rates. And this all for our today's lecture and see you in the next one. Thank you.