Hello everyone, welcome to our compatibility module 338 and today we will continue on differing. Lost in the last lecture, so we have introduced what is deporting so different is to learn some deep repetitions of these inputs. So it's compared to what we had for handcrafted features. So for example, saved or LBP, so we design what features we want to attract from the data in deep learning. Instead we use this new network. To learn the features. For for us, so we don't need. We don't say OK always, future you we we used to extract this feature instead we have these neurons to learn these features from the inputs and the waves will be adapted to the input data. This is contrast to feature extraction, so that's why we call it handcrafted features. There is a person is crafting these features OK, but for deep learning we use the network to extract the features and also we have the network for classification as well. But for both of them, the traditional machine learning and deep learning. So we have FX equal to Y. OK, so we have the input XX and we have the output as Y an. The only difference is F4. For machine learning we have handcrafted features but differently. We have new networks OK and we have seen. Some applications of deep learning or we have introduced some basics of deep learning. We have seen artificial neural networks. They strap holds neurons and perceptions and we have seen activation functions based Relu. These threshold functions and we have we have introduced linear separability. Adam from that we introduce why we need multilayer neural network. So with one perception only one nodes we can achieve the classification of linearly separable inputs. But we can't achieve the classification. For example X or X or function. So that's why we have multi layered network. And today we will continue on this route. We will introduce feedforward neural network and will introduce how to set ways and thresholds. So you may remember for the tier TLU thresholds loading units. So we have this ways and we have this. We have a we have Theta so in that case we give some numbers and so that we can achieve. All and boss. Like I said before, we need to autonomic when we need to size with our thresholds automatically. So this is the topic of today and we use by propagation to sideways and threshold. OK, first question. So how to sideways and throw hold on why we should sideways and threshold waterfall. So we have seen these neural network, so from the last lecture. OK, so in the last lecture so we have these different layers, there's nodes and we get this output. So we have the inputs and we have the outputs. So from imports to audibles. This mapping OK. Here we read it in a different way. Put it in a different way so we have the input vector. Here X one X2X3 does the input layer and we have the hidden layer. There is an all and different functions and then we output wise. So this is the output of 102. So that's the output layer, so the information will come from the input layer to the hidden layer and choose the portfolio. So that's the feed forward with at work. So we have the data and we get the information and we guess the prediction. That is why so, for example, we have an image of the car and we want to classify what objects in the image and we gave the article. In the article layer we will predict OK, this is the car. So that is the information flow. So from input to the output. So typically this is supervised learning, right? So we have one inputs. We have one input image and we have a label for the inputs. So we have this training. And the testing. And in training, so we have one image and we have for example we have these. Store OK. And we have the label. OK, just to tell you so. Here we give our label by a cloud Clock tower for this image, but usually including. So we give these numbers. This class is like. To each class so 12345 these numbers for different classes. OK, and so for. So we have a lot of training images with labels. OK, every image. We know there's a there is an object in the image and we have the label voice and in the testing we have one image. Another collector OK and we need to classify this is o'clock tour. OK, this is the inputs and we have trends. The model this is this new network. OK, predicted this is a cult or and one here we have to impose and we use the network so the information will go through this network and we predict. This label why? OK we predict this is o'clock or so. This is a supervised learning. So in silver learning, we train an artificial neural network with a set of vector pairs, so-called examples or training data. So by providing enough examples so we have seen these cultural many, many times from different angles, different light conditions, we have enough examples. So the network may learn good features that describe the data in general. OK, so we get this trend model now. Once we have this testing image we have this image goes through the network and we get the prediction. OK this is. This is a culture that's how we do with artificial neural network and a training pair. So X&Y consists of input vector X and corresponding outflows vector Y. Um so. Like I said, 4th. So we have Y. Equal to FX. So that is the mapping from input vector to the output vector. So we have are the network receives input X. We would like to provide output Y and the examples that describe the function that we want the network to learn. OK, this function. So that is the network. OK, so we use a network to predict why so besides learning the example examples, we would like our network to generalize OK, so when once we have this testing images at this new data, so we want our model to give good predictions. So give Closable output for input that the network hadn't seen or hadn't been trained on. OK, so that's what we want to achieve using the network. So there's a tradeoff between the networks ability to precisely learn the given examples that is treating. OK. And there is ability to generalize. So that is to predict the labels of the new data. OK, so sometimes we may train on model that can work very well on these on our examples in. I've seen in the database, but it may not be working well for new data. So for example, we collect a lot of data for. This character in the data it's OK and we train our model and we can have this model to precisely learn the given examples to predict. OK, this is a cluster, but if we collect a new test image of this cluster Internet and we use the same model to predict and probably it cannot recognize, this is a counter as this means it cannot be. It cannot be able to generalize. OK. And this problem is similar to fitting a function to a given set of data points, unless I assume that we want to find a fitting function. So from this space to another space for a set of three data points, so we can try to do this with polynomials of different degrees, or we use a polynomials of different complexities. So let's see this example. OK, so basically this is a function of approximation, so is to get this function. So basically a new network. It's also a function, right? So it's a function approximation. We have input X and we have the prediction FX. Now we have three points, like we said before, so these three points. An all of them they have there. So. For each data point. So this is as one and it will have corresponding. Why one we call it ground truth? So the label we can we get it OK? And for next points X2. It also has a ground truth why too? And four X3. It has this one. Why three OK? And then we have a polynomial. Is. Function approximation to approximate how these three points are distributed in the space? OK, that's why we call it function automation OK and we get a line to approximate the distribution of these three points. So this is the. Long we have a degree of 1. So for this line. OK, so you can see there will be some errors between the ground truth why one as predicted? Value for X1. So we have a white one. The predicted Y1 value. OK, there's a difference. You can see here and also here we have a predicted. Value for XT. So we have Y2 predicted OK and we compare the difference. Same for extra. And also we can have a degree of two so. We can write these on the line as Y equal to W X + B so you can see the degree is 1, right? And of course, the degree of two. So we have Y equal to WX square plus. I. W. 2X plus. Be. OK, we have a degree of two, so here you can see this approximation will go through all of these points by and. So the error will be there. Right and also we can have a degree of none, so Y equal to. Stabl example. W1X29W2X28 so it has a degree of not so you can say is very complex. And same the same two degree of two. These curves will go through all of these data points. An it will have the error at zero as there is no difference between the predicted value and the ground truth values, so they have these errors. So now degree of two and degree of none. Both of them can achieve very good performance for our training examples. So these three points are our. Sorry. Our training examples, but when we have a new data points. So for example. Um? Here. That's our test data points and when we say we want to evaluate how the model generalize to the new data points. So this is the case, we have a new test point and will see the difference between. The. The difference between the predicted value of this knew this new data points and its ground truth value. So you can see here. So for the grade of degree of two, the error will be also very small. But for degree of nine. He will have a large. Difference so which means both degree of two and degree of none. They can do well for the training. OK so they have well for the trends examples but the degree of two will generalize much better than degree two degree of align model. So to evaluate this difference, models or different functions or so how here we evaluate the neural network. So the basic idea is to define. Error function and mirror error for entrance data. So that is the testing site. So as before, we can use difference error functions. So here. So basically error function is kind of evaluation of of distance. So here we have the square roots of the differences between the predicted value and the ground truth out books. City is a desired output, an all is actual output. Actual output is predicted by the function and desired output is. Sorry, actual output is a ground truth. The desire to prove output is the predicted value by this function, and we compare their difference by these square root of. Um? By calculating the distance between them. OK, and for classification we can have another form of error function. So already we have the number of correctly classified samples and divided by the number is the total number of samples. OK, just to give you an example. So we have like 100 test images or fruits. OK and we give the predictions of their labels and we can see. So we have like 80. It is images. They are correctly classified. Serve the error function will be 80 divided by the total number of samples 100, so the error will be 0.8, so that's how we define for classification. But for sure, there are many other error functions, so like the softmax error function for classification. So we can use different error functions for the neural network. Like we said before in the previous lectures for comparing histograms. So we may have histogram intersection, we menus equidistance, so we can have different options. Some of them may perform well for this task. Some others may perform for other tasks. OK, so now we have this error function and we want to make our goods model to approximate this function and we want to minimize the error between the predicted value as the ground truth. And this is by propagation, so we adapt the weights and thresholds. Of the activation functions to minimize the error function. So this is by propagation. The backpropagation algorithm was popularized by. Roman has an Hinton and William in 1986 is quite old exam, but it's quite efficient and is quite powerful, especially for training neural networks. So this axiom solved the credit assignment problem, so that is crediting or blaming individual neurons across layers for particular outputs. So the error as the article layer. It's proper case backwards, 2 units at lower layers so that the waves of all numerals can be adapted approximately. So you can see that's why we call it by propagation, so we propagate the errors in order, pulls back to the neural network, and we adapt the weights to minimize the error function. So the goal of backpropagation algorithm is to modify the networks ways so that the output vector. It is as close as possible to the desert. Outpost vector. Sorry. Sorry before it should be at. the DP should be the desired output vector. I mixed up this. These variables. OK, so we want to minimize the outflows this vector. As a desert outfalls vector, yeah, But anyway so there is a ground truth and there is a predicted output. So we want to minimize the distance between them. So 4K output neurons, an input patterns and P. And the set of input output pairs these examples. So we have these XP that for example we have one image and we have the desired desired. Desires, labor, voice this is a Clock tower so we have a number this Class 1. And we have all of these images and labels for all of these training examples. And then we firstly we need a communitive error function that is to be minimized. Like I said before, this error function can be like. The histogram intersection or it can be like accuracy or other. Error function, so here and just take it as the distance return the predicted values and the ground truth values. OK, the typical true choice is mean square error. OK, and so we guess the square of between the different of the difference between the predicted value and the ground truth value, and we get the square and we get the mean because the average of these errors. And the problem here is how to minimize the mean square. So we use gradient descents to get this application done. So gradient descent. Basically that is to do application to buy, propagates the errors in output through the new network and adapt the weights. Gradient descent is a very common technique to find the absolute minimum of a function. So it is especially useful for higher dimensional functions. So gradient descent and is used to interactively minimize networks error by finding the gradient of the error surface in with service in with space and adjusting the weights in the opposite direction. Here is a simple example. So here, so for gradient descent, our aim is to find. The minimum of 1 dimensional error function. So you can see here we have a curve and we what we want to get it to find the minimum of this function again. So it's similar to is the same as a neural network. We want to minimize the error. OK, so here we just have one dimensional case, so we have a curve and we want to fans. The global minimum. All of these are correct. So we start from zero and we get the. Avax this error so here FX. This is the error function. OK. ASMR, where gather slope. So that is the derivative of FX and we go along with the slope, so we'll go to X1. And then we will go further. Until we find the minimum of this function. So we repeated this iteratively until for some X F0F, the directive of XI is sufficiently close to 0, so you can see here. So that the retail will be 0 and then we find the minimum of the minimum. In the in the space. But you may argue OK for this case, we may follow this flow a slope and we get the gradient descent at zero here as well. Here we get a local minimum, so if we take several local minimum and probably we can find the global minimum for this error function. So here this is a very simple 1 dimensional case and let's have look as the two dimensional functions. OK, so here we have this error function and we can fund so similar to slope we can go through this 2D space and we find we can find where the derivative would be equal to 0 and probably we will find other local minimum or a global. Minimum. Or it can be the. It can be a maximum, but we can compare the values of this error and to find whether it's a minimum or maximum. Here is a graph of the heatmap of these. Of these function, so that difference repetition of the function. So here you can see we have different arrows, so they represent. The gradients of the function at different locations. At the gradient is out of always pointing in the direction of the steepest increase of the function. So in order to find the functions minimum, we should always move against the gradient. OK, so as we want to find the minimum of these of these function. So we showed to go against these errors. So here the error shows how the function will increase. So as I said before, we want to minimize minimize the error function, so that's why we should always move against the gradient. Now this is a 3 dimensional example. So full had dimensional space, it it is hard to realize, but imagine this in the 3D World like we are. So if you go to. Amount and OK and you want to you want to guess. The. Hold on the bottom of the mountain so you walk around and finds there's if you find. Oh basically, that's the gravity, right? So you follow the way of the gravity and go down and you can find there is the bottom of the mountain. So this gravity is something like the. So this is against the gradient. So for gradients you will go up. So the gravity is against the greater is great against. Is a gradient and then you can find the minimum of the height of the mountain and you can find the bottom of the mountain. So that's how we do with gradient descent. And here is it, will involve some math, some calculations of derivatives. OK, so partial derivatives for multivariable functions. So we'll see one simple example. We have this error function, for example X&Y&X&Y. This is the function and we want what we want is to guess the minimum of this error function as. Then we can have the partial derivative of this function. So the part of their durative of F2X so we it will get why? And for the pirate derivative of F2Y. So we get X. Anna, for another form another function X + y and we get the partial derivative of F2X we get. One and partial derivative of F2Y get one. You can check out this partial derivatives. There's a table for all of these for most of the common functions online. OK, if you are given one question to get these partial derivatives, so for sure we will give you how to do this partial derivatives. I mean in the example. And there are some rules for derivatives for compound functions. So for example here we have FXYZXY multiply Z, so we breakdown the impression into Q equals two X + Y and of equal to QZ and we have partial derivative F2 Q equal to Z. So you can see here and then the power durative of F2, Z would be equal to two Q, an power durative or Q2X. It'll be one. And here it will be apart, durative, Q2Y or B1, right, but we want to get the power derivative of F2X or Y. And so here we will have these channels. So the partial derivative of F2X can be decomposed into part durative of F2 Q. Multiply the partial derivative of Q 2X and it will be equal to Z and partial derivative of F2Y. We can decompose decompose into power derivative of F2, Q, and power durative of Q2Y, and we can see. So that's how we get this power. Durative of F2X and party of derivative of F2 One, and this is quite useful for you network for better propagation and using gradient decent. Again, so this is a simple kind of network. So we have three inputs X. YZ an we compute the outputs as X + Y. Multiply Z and we get the output. So in the forwards. In for information forwards 10 so we get the audibles OK. So first we market away, we sum X&Y. Manners 2 + 5 so we get 3 and we use 3 multiply Z manner 4 so we get the output minus 12. So that's the forward information OK. So now we get the articles and we want to minimize these manners tell. So we take it as as error function. OK, so we want to minimize manners. 12 What shall we do? OK, so we go backwards. From F from this audibles we go backwards. OK first we see F. So F. E2 QZ and we get the partial derivatives and then from F we can get the para durative F2 Q. So which is? Z. So as it is equal to minus four right? So we get better for. And there also we have manners, are the power durative F2Z, so we have. So it will be cute, right? So as Q is 3, so we get 3. And then we can go from Q back to X&Y. So as a here you can see. So there's Q an partial derivative of X is F. So from this one. You can see partial derivative acts to prior to 2X will be discovered decomposed into this way, so we guess. OK, the partial derivative will be Z. Here partial derivative of F2XF2X. So we work at Z so it will be minus four and the same for the power durative of F2Y. So this is also easy, so we have minus 4. So now we have all of these are durative of F22 acts, so we have manners for this is metaphor, and this is 3. OK, and then we adapt these ways. Based on they say. Gradient descent OK, and then we can get and you wait. So for example here. So we have minus 2. And this is manners forth. So we work as adapts these acts into knew. Value so we will get it Matters 6. OK, and that's how we adapt XY and that. So you can see that two information pipelines, first one. This is the feed forward, so we have the inputs and we have these ways and we compute the output. Right so we have double X + B, so we get this ordered and we get the loss. Advanced whereby propagates this loss. Through the network we adapt W. So the W can be Y or Z or X. This network OK and we adapt these values. So this is a very simple example for gradient descents for by propagation learning and here I will give you a homework. So to buy propagates this function OK. So we have the X and we have the weights. Try to construct one your network using this function. As Byatt propagated the error function through the network, how I try and you will get a better understanding or by propagation learning. So to wrap up our today's lecture. So first we have seen feedforward neural network. So we have the inputs. We have the inputs combined with the weights and we get the outputs the predicted value. That's the feed forward network. And we want to minimize the error function return, so we we calculate the error between the predicted value and the ground truth value. We want to minimize this error, so that's why we need to adapt the waves and the thresholds an in the nude. At work and then we use this gradient. Decent methods for back propagation. So we back propagate error through the network and way adapt the weights of the network so that we can minimize the error. So basically this is the training. OK, and by gradient descent is that or 4 by application the back propagation is how we do training of our network. OK, and also you will practice the application in our lab session and you can find out more details and get more insights into by application in practice and in our next lecture we will introduce convolutional neural networks, that is, that is specifically useful for for image recognition for computer vision. OK, as we have images at the inputs. OK, see you in the next lecture, thank you.