Hello everyone, welcome to our compatibility module so it comes through 3/8 an today we will continue.
Give me one second, let me share my screen OK.
So today we will continue convolutional neural networks. So in the last lecture, so we have covers the network training. So today we will continue.
Network training.
So to have a quick recap of what we have covered in the last lecture. So in the last lecture we have introduced the payment. The pipeline for training the network.
And so basically this is the same as we had before. We have training and we have. We have tests and we use a few of the training examples, training examples to train the network basically is to adapt the ways to minimize the loss functions.
So we have compares the training pipeline for the traditional machine learning approach and also the convolutional neural network approach. And we have compared different loss functions. So for classification, so we have a softmax function to predict the probabilities of assigning the inputs to a label. So we have this scores so they are not in the range of their R1.
So as a way of predicting is a label of the inputs, so we want to get a probability of OK is 18%.
Is a dog 10%? It is a person so we get the probabilities of assigning the input to a label. So we map there's skulls into these probabilities, right? And we use this sort of Max function to do this. To map this scores to the end of the row and one Ann, and we get the probabilities.
So once we get these probabilities and we compare these probabilities against the ground truth, so for the ground truth we have these numbers. So we we are 100%. This is a CAD 0% for other classes.
And we compare these two vectors so far is the probabilities that predicted probabilities. We have these numbers and for the ground truth we have these numbers. So the reason why we have one for one class is the image the inputs was labeled by a human labor OK, and so that's why we are sure this is a cat. This is a dog and that's how we provide the ground truth label for the input.
And once we have this predicted probabilities and we have this ground truth, we can guess the loss for classification. So we use cross entropy for the classification problem. So for the cross entropy, so we have the sum of this their ground, choose T and multiply log there is predicted.
About it, right? So as here. So in the last lecture you may say these probabilities, and here we compare them together. And here we have this first class as one and we have log 0.7 and also in the last lecture we have showed you the intuition of cross entropy. OK, what if we have a very good prediction? One loss we will get. What if we have a?
That prediction what loss we will get. OK, so that's the classification loss.
And a full regression. So is our old friend. We have seen this regression before, right? So we have seen this.
Predict the real numbers and we use mean squared error to show the loss of the network. So for example, if we want to predict a location or one face Mark official landmark in the face. So we use mean squared error. So that's a regression problem. So we regret to this location values. OK, that's why we call it a regression.
And today we will introduce different optimizers. Actually, we have seen there is backpropagation.
Before and today we revisit this. The back propagation in convolutional neural network and see awards optimizers we can use so optimizer is to perform backpropagation OK and we have different techniques based on this very basic gradient. Decent this how to compute the gradients. An optimized, optimized with based on the gradients. So we have gradient descents and we have different variants.
Of gradient descent some they may have momentum, some they may have adaptive learning rates and will see other grades an as prop and also add them so they are based on adaptive learning rates.
So for this lecture.
We may we will see some equations math in the lecture bus.
Remember, so for this lecture you don't have to go through the details of this optimizers or these measures. You just need to know these names so we won't cover this equations in our in our exam or in the assignment, but you need to know. OK, we have Adam, we have this other graph. These options when you train your networks in practice. So for example in your assignment.
Even though you don't know these details, all this the details of the equations for other, but you know, OK, I have this Adam option to use to optimize my new network. OK, that's the main aim of this lecture. OK, that's an overview of our today's lecture.
So first of what is optimizers? Whereas an optimizers.
To have a quick recap, as I have mentioned, this is related to backpropagation learning, so we have seen this back propagation learning in our previous lecture right? So we have this gradient dissent and we have seen this one dimensional example. So here what we want to achieve with Optimisers is to find the absolute minimum of this loss function. This this error function OK.
So FX, so this is the error function of the loss function.
So you can see here that have. That's the function and we have the access access as X and we have this curve to show how the error function will change according to X OK.
And we want to achieve the absolute minimum here that's here.
What can we do?
So imagine the best way, or the simplest way, imagine this is a curve and we can put a ball at any location of this curve. So for example, here we put the ball here and we make it move friendly.
As the ball will definitely move towards this way, right?
And as.
After some time it will arrive at.
The minimum of this parts and it will move at this location and we guessed minimum of this this error function, but there's some maybe some problems if we put the ball here OK and it will move towards this way, right? And it will arrive here if it it moves very slow and it was. So stay out here.
It's not the best solution as we haven't arrive at the absolute minimum of the whole function, which is here, but we arrive at a relatively good solution. So as the error here is also minimal.
But it's a local minimum, right?
And this one is a global minimum. So when we say global minimum, it means that's the global, that's the minimum for the whole function.
For any acts, so for the local minimum it means for this reason for some parts of the function.
Is the is the minimum? OK so for this area it will be the local minimum.
OK, that's the basic idea behind gradient descent. That's what we have mentioned before, right? So mathematically, what can we do to mimic this ball?
So that it can find the minimum of this function.
So what we do is we first we get one.
Initial location that is X0 and we get.
A slope of this of this curve at points X zero. OK, so we get this slope.
That's the slope, we guess.
And so for the.
Gradient, so for the gradient here. So the gradient here. So what we get is we.
So the gradients will go towards this direction as we want to find the minimum of this function. So we need to get the negative of the gradients to this way, right? So that's how we follow the slope in this direction. So that's the inverse direction of its gradient. That's why we call it gradient decent.
Right?
And we follow this low and we get.
Somewhere here.
OK, that's another location. So for example, we we follow this slope and we get this location and we find another point X one and we take away.
We get these FX again, so that's how we fund this X one so you can see here we have X0. It will go to the negative direction of the derivative of FX of FX XX0. OK so the gradient?
Equal to FX0. And then we we have mentioned we will go for the negative, the inverse. The direction of this gradients. So we have negative son here.
So that's the gradient. And here we have a step size. How much step we want to go in this direction on this slope. So we have a learning rate, or we call it a step size eater.
So you can see we move along this slope from zero to so by this.
Distance so that is E to F prime X zero. So either multiply the gradient so we go for this distance.
In the slope and we get as X0 and we find another point and we do it again and we will follow this gradients and we will arrive at one point that is, so who's?
Gradients will be 0, so that's the minimum point. OK, so we repeat this process again and again interactively until for some XI. So F as is gradients, is sufficiently close to 0, so then we get the minimum of this function and we get the minimum of the error. So that's what we expect. So we transit network to get.
To minimize the error function.
So we have seen this one. That's dimensional example. What can we do for training the network tactically so we have different ways. So first we have stochastic gradient descent.
So what we do is we take one example 1 sample from the from the training data size and we update the ways using this this sample. As this sample is taken from the input randomly, so we call it stochastic gradient descents as we take one sample each time.
On the other hands, we can have best gradient decent. We take all the samples in the training. For each iteration, all of them to update the network. So we have the loss and so which is the sum of all of the all of the error functions for its sample. So that's how that's why we call it.
Batch gradient descent as we take all the samples, so one batch means we take all of the samples as the input.
Once OK, so this is in contrast. Again, this is in contrast with stochastic gradient descent, so for static for stack plastic grid, dissent, SGD, so we take one sample each time, and we update the ways and for batch gradient descent we take all of the training samples.
All of the training samples into the network and we updates the network once. OK, that's the batch gradient descent.
And there's something that lands in between, so we have mini batch quantity distance. We take some samples from the training data size so a mini batch of samples are used in each direction iteration. So for example if we have one thousands.
Of.
More realistically, say we have one 10,000 training images, and for each meeting, but many bytes we take for example, 32 images from this front of from this training data.
So that is equal to M.
So this is the.
Batch size, that's the number of samples in the training data set.
And for the mini batch we have subjective every time we take 32 samples from the data set OK and we have M samples in each it mini bias and then so you can save the relationship between base the size of many buyers and the batch size. So definitely this mean by the size of mini batch will be lower than the number of the whole training datasets.
Ann is greater than one.
And if I'm equal to 1, we become the world becomes stochastic gradient decent OK?
And so we have seen this in the previous lecture. So here we revisit in this gradient descents. So before we mention this in how to submit it is how to face the data into the network. Here we see how we use it input data to train our neural network. So there was a mini batch means OK so we have stochastic gradient descent. We have batch gradient descent.
And we have minimized quintesson OK, and in the practice you can try OK to fit. You can try failing the whole data set into the network. You can try fading 1 sample each time to the network, and you can try face mini batch of the data into the network. So it's quite obvious lies they have different features. If we have these different inputs.
So for example, for stochastic gradient descents we may as we updates the weights every time when we face 1 sample into the network. So the optimization may be very unstable as every day to every sample will be different and for the best gradient descents it will take a lot of time to update the network for once as we take all of the data into the network and four megabytes.
$0.20 It combines the features of stochastic gradient descent and batch great distance so it lasts between, so it won't take very much time too.
2 updates The weights of the network for one iteration, but also it will. It won't be that unstable as the stochastic gradient descent.
So here is a comparison between batch gradient designs and mini bikes Grand Designs. So as I said before, there's if we take some batches of great descents, so it may be unstable, unstable compared to the batch gradient descent. As for the batch gradient descent, we take all of the data into the network, so we're always updating the error function to the same data.
So that's why the cost will decrease along the time, allows the number of iterations, so this is the.
Xbox OK, so the numbers of iterations.
And here so you can see here. So we the cost will be decreasing.
Along the number of mini batches this is also equal to a box, right?
So you can see here as every time we will sample a mini bites of the data set so each mini batch may be different from each other, so that's why we are updating the cost according to different data. So that's why we have these difference and that's why it will. It will be unstable compared to the batch gradient descent. You can see we have these curves and we will get this GW.
There's.
This cost, but you can see the trend.
Of these costs in mini batch gradient descent is very similar to.
The batch gradient descent so you can see here the trend will be the same.
That's why we face one mini batch into the network each time. It will save some time, right? So as we only take some part of the data size into the network and on the other hand the trend will be similar to the batch grid distance. So that's why in practice where you already use mini batch gradient descent.
So there are some problems with screen dissent. That's what we're addressing here. So for example, for this 2D example, so we want we want to arrive at the global minimum of this function here.
And we have two.
Access.
And the cost for the loss will change differently for these two directions. So for example here if we go this direction so you can see the loss will change very quickly and we can arrive at this.
There is a global minimum very quickly, but if we go this direction so you can see the actually the height, the value of the error function won't change very much. If we go this direction right.
An so.
This is a very good example for the loss changes quickly in One Direction and slowly in another. So what does the great dissent do?
OK, so that's what we will see for this example, so you can see we will have very slow process along shallow dimension, so you can say we start from this location and it will move towards this direction. So as we have mentioned before, the loss will change very slowly in this direction.
An after it will arrive at after a very long distance. OK, it will realize OK it will.
It's it's changing in the gradients and it will go back for here so you can see here we have some jitter.
So from the initial location to the global minimum of the error function.
So that's why we don't don't want you.
Act as we want to arrive as the global minimum as soon as well as possible, so which means we what we can do is we can use some solutions to minimize these data again.
We have different solution to this problem. Before we show the solutions, let's have another look at the backpropagation learning. OK, so this is the one dimensional example again, so we have this error function and we we will follow the slope to guess the minimum OK.
And we repeat the process until we find the.
The gradients is equal to 0.
So you can see here we have different ways to.
Minimize the data we have seen before.
Again, we have a ball here and we move this ball alongside this curve an it will arrive at this.
Is the global minimum, so we can have a few solutions here. OK, so from this location we can give it up momentum so it it knows we will go through this direction, but we we can give it like a gravity to move in this direction so it will can move faster towards this.
Global movement minimal. So that's one solution. The second solution will be OK, the step size, so ITA.
So either is multiplied to the gradients, so if we change the stepsize, so every time we will move differently, and if we change the ITA.
According to how it moves. OK, so we may arrive at the global minimum more quickly. So for example, if the error function is lost as a here, so we give it a large step size, it can. It can change quickly so it can jump to this location, but if we if we arrive at the global minimum, we still have very large step size, so we may go over the goal.
Global minima right we go may go to this point so in this way in this case we need to get a small step size which has a into a small value so that it can.
It can find the minimum in this region, so we adapt this learning rate. This step size in this process, so there are two solutions. One is with David momentum. Second we give it adaptive learning rates, so let's see how it works in practice.
OK, so the first solution we ask the momentum to the great distance.
So you can see here we originally if we follow this slope and we have this data. So imagine you put a ball here and so we give initial this gradients so it will move towards this way. But so it will.
So we have this.
A lot of crafts.
So if we get this momentum so far after some.
These moments, OK, we get up like a momentum for this moment as is going to this way to this way and this way OK we sum up there. This this gradients and how it moves in the past time so we will get the conclusion. OK we get momentum in this way.
It's always going to two. That way we have a some of these moving directions. We get a conclusion. OK, it's going to be this way and for the next few times. OK, we have this data and again we guess the sum of these directions. OK, the boys going to zetera direction and by using this momentum and will get the ball to move.
More quickly OK towards the global minimum.
So that's how we represent this momentum into the actual step.
So we have this. The ball at this point and we have the gradients in this direction and from the past directions we get this momentum because the velocity velocity of the ball. So the velocity is the momentum.
So we know how it moves in the past so and we will follow this velocity directions and adjust our moving direction. So we combine the gradients and also the velocity and we get a combined moving vector. So that is the Acura step.
OK, so that's what we have for the next step for the for the moment. OK, so for this case we ask the movement momentum to the updates. So basically we asked velocity to the to the to the movement of the of the of the ball in this error function OK.
Anne.
So that was the letter words one solution and the next week and see how we can have adaptive learning rates adaptive learning rates for.
Or the great distance?
So again, this is 1 example.
When dissent, so for the back propagation learning.
So again, so we have these slope. We follow this a slope and we get another location and we find the minimum of the error function.
The here.
We want to change, adapt the learning rates in this error function. So as I said before, what do we expect?
As a beginning, if the if the error function is large.
So for example here. So we want to move quickly as we are far from the minimum so we can have a large step size so we can move to this location. But if at here the error function will be much smaller than this point OK, and then we can.
Get a smaller step size or we can get a smaller learning rate, so we will find OK. We will use this learning rate to update the error function and we move slowly with a small step size and we can find the minimum of the error function at this point.
So by using a smaller step size will move slowly in this curve and we can find the error function.
Or we can find the minimum of the error function.
OK, let's see how we can update the learning rate.
So here is a.
Imitation of the gradient descent in a mathematical way, so you can see we have the updated.
Wait so you can see that the theater, so we usually we use theater to represent the width of the network OK, or the parameters of the of the network?
So you may remember, in the largest in the regression models we also use this data to represent these parameters.
So we already use data to repeat them. Set the parameters for the neural network as well.
So after updating the network we get the update is wait that is Theta T + 1.
At before we have the before we have the.
We have the we have the width as synthetic, so that's before the update and here we use Alpha.
As the learning rates.
All the step size, how much step we want to make each time, so that's the Alpha and we have the gradient.
Of the.
Cost that is a loss function to the parameter Theta.
So D.
That is the cost.
Function OK and basically here we have.
The partially the derivative of G2 seater so that's the derivative.
Or the gradient?
Again, as I said before, we want to move the inverse direction of the gradients. That's why we have manners. OK, so we have safety measures of an multiplies. The gradients is exactly the same here. OK, so we have X1 equal to X0 matters into a prime X 0, so we have this learning ways and we have these gradient. That's how we update.
Face a Zeta OK.
So it may get confused, asks here X key actually is not the input for the network is. This is just an example of.
Of how this error function works, so FX is not why it is the error function OK.
Remember, this is the error function and X. Actually this is the theater the parameters, so we have mentioned this in in bad publication learning so we are updating this system. OK so we want to fund.
At waste parameters, data at which X here we get the minimum of FX. We get the minimum error so that what we had you need to classify OK. This X is not the inputs for the neural network. This is the parameter the wait for the network here FX that's the error function is not why in the OR that's not why the predicted.
Values of the network OK.
OK, back to our adaptive learning rates. So that's how we update the wait. OK, say to T + 1 equal to city manners. Alpha multiply is gradients.
And we have a few options, so we have other grad we have arm as prop and we have Adam. So Adam is in short of adaptive moment estimation. OK, so Adam is to combine this moment and also adaptive learning rates.
So first let's have a look at the other, grab other grades so you can see we take G as the gradients of this error. This the cost is error function.
And we guess one intermediate an.
Value that is our our will be equal to are the papers are plus J square.
So we guess this G and we use this R2 updates to get the updates. The learning rate here so you can see here. This is the learning rates.
So you can see here we get one hyperparameter divided by R Plus this have a pen palmater this value. This is to avoid.
Zero to make art is always AR plus.
This number not zero OK.
If sometimes at the beginning, so if we have are equal to 0, so this is a non 0 number or it can be that their point their points 01 and we have our multi plus zero point 0.1. So this number is not zero anymore. OK if R is 0 so as is here in this part. So we want to avoid the some of them.
To be there, right so?
That's why we have these epsilon for this dinner later, OK?
And we use this number to multiply G, so G is the learning rate. Today is the gradient, so you can see if.
Gee.
It's larger.
So you can see there is a number.
As Jay is going is a is larger so our would be larger right? So I will be larger, so this number will be.
Smaller, right?
And so this pause.
So it will be larger.
So you can see if the gradient is larger and we get a smaller.
Step size and we can update Theta an using Theta manners. This difference of Theta to guess the new system.
So you can see we evaluate how large the gradient is and we can update the Theta according to the.
The value of the gradients.
OK, so that's the Adam. Add a grad.
So basically this is adaptive learning rate for gradient descent. So add a graph.
And a full arm as prop so you can see here is very similar to what we have seen before.
So we use G2 to present the cost function, the gradients and then we have this rule multiply. This expected this J2J square and multiply 1 -- 0 zero is the weight factor, so it will wait between these two.
Factors. So first we have this expected squared of the gradients, and here we we have this for this one. For this gradient is square, so we combine them together and then we get the expected of J square for that empty and then we use a very similar equation. So you can see this is the learning rates.
Right, so if J is bigger as gradient is bigger so the EJ square will be bigger and this part will be smaller.
OK, and then we update the.
The ways sitter using the same arterial so that we can include the gradients in into the learning rates and we can update the gradient. That's Ms prob.
And finally, we have Adam. This is the adaptive moment estimation. So which is a very common default setting for training the new network.
So that's Adam and we have very similar case. We have G to represent the gradients of the network.
And we guess.
They wait for the momentum the moment of.
This gradients so you can see it will combine these moments and gradients and we get this number.
And then we also adapt these learning rates by using.
Days pass.
And using this part we can adapt this learning rates.
OK, here we use the moment to update the cost function. OK, so there's a time limit so I won't go into details of Adam, but you need to know.
Adam is one of the is the most common, most common default setting for training a network. So if you're interested in more details, you can check out more about the Adam. There's one paper Adam, so you can. You can check out how it works in that paper. I will upload this paper to choose the colors.
And to wrap up our today's lecture so we have introduced different optimizers. So we study from the most straightforward way. So that's the gradient descent. We have seen stochastic green designs. We have seen batch green dissent. We have seen minimize green descents and we have some different solutions to the problems of great distance. The verse first solution we include momentum in the.
In the.
Advisors and for the second for the second approach we have adaptive learning rates, so we have other grad. We have our math problem and we have other. You can try out different optimizers encoding ANSI how they work and compare their performance.
And so, as I said before, you're going to have to know as the details of the equation, how to update these learning rates. How to get this momentum or work, but you need to know these names. In this options. You can have to update the network OK as the next lecture, so we will cover regularization.
See you in the next one, thank you.