Hello everyone welcome to our computer vision module. So 38 and today were war continue on deep learning.
Just to have a quick recap of what we covered in the previous lectures. So in the previous value lectures we have introduced deploring. We have seen this artificial neural networks so layer by layer and we have seen perceptions and we have seen why we we need to have multi layer neural networks. So for example so we need to achieve this XR.
Function using multilayer on your network and also we have seen by application in the last lecture. So using great dissent to do this by propagation we have seen one very simple example example for backpropagation.
An there is networks they have acquired general and therefore.
All the problems for processing data, but before vision we have a very typical data tab that is image, so we have a 2D image. So we have a range. So to process a rains so we have convolutional neural networks.
And you may remember we have covers convolution in our previous lectures at the very beginning of our module. So we have some filtering. We have seen this or we call them like masks to convolve with images to learn features OK?
And for that case, is or for those cases we have pretty fun filters or masks we have. These, for example, Goshen masks, and we specify the parameters for the Goshen filters. Here we want to have the new networks to learn this ways by itself again, and that's why we have convolution layer. We have these parameters.
And we put them together and the network can learn its ways by using bad publication. OK, so same we used by propagation for convolutional neural networks by instead. Here we have images as the inputs and we have conclusion as the operations. So for the weights OK.
So basically all we have WW X + B. Here we have the ways as convolution. There's the connection between artificial neural networks and convolutional neural networks.
So today we were first to say they will recap what is convolution and then we will see convolution layer and we put them together as convolutional neural networks.
So first let's have a quick recap of convolution, so convolution is to apply these operation on the image by using the function again. So we have the input image and we have the filter and we get them in through this convolution operator. So this is the conclusion operator. So what we do is we have this image.
And today's convolution so you can see here we have this shift.
So it means we, like we did for conclusion and filtering. We flip is future 1st and we allow the future as the image together and then we do this multiplication and then we get the sum of them. So that's what we did in our previous lectures on conclusion and future run.
So it's the same for the conclusions in convolutional neural networks, so we have one input image and we have a mask H.
So what do we do first, is we flip the rows and columns of the mask? So focus on here. 11010 minus one. So once we Flake is Roseanne columns so you can see minus one will go here minus one and this one will go to here.
And this.
Elements of local here. OK, that's how we do this. Flipping an this flipping 1st and then we modify the corresponding elements in the mask. In the flip the mask and the input image OK and we get the result.
Are you really what we do is we have a similar symmetric mask or filter, so once we flip, flip the rows and columns it will be the same as itself, so it will be the same so we don't do this flipping anymore. That's usually what we have, but sometimes we may have different metrics after we played rows and columns.
And that's the left have look at how we do this convolution operation. OK, so we have this image and we have these.
There is a mosque and we do the convolution and we get the order food.
So I prob is the odd folds of convolution operation.
Before we go into this detailed convolution operation, so let's have look at some of these.
Terms so first strike.
So what is stride? So that's the distance Canova.
Step when we move this mask along the input image OK.
So when we apply this filter to the inputs.
So first we apply these future at here OK and we slide this filter filter along the input image. So we have a step size. So if we have a strategy equal to two.
So we will put this mask here so you can see the center has stab from this point to this point.
So we have a step size I think, so we call it thread.
As to? OK, that's what we have so from.
There's window to this window.
So it has a stable step size of two, so we have a stretch too.
And once we apply this stress so from this window to this one though, and to this one, though.
So you can see.
We will have a smaller size for the output image.
Compared to the input.
So here you can see if you do this Strat.
We have that first window, an second window and the third one though, and we will do the same for the rest. So for here we also have this sliding window, so we will have the next one here.
And next one here. Next one here.
An as the end will walk out, outputs a three by three so you can do it by yourself and you will find.
So we'll get three by three or so 4H sliding window, so we have one output. So you, ma'am, you may remember for age conclusion, so we get the modification of each element element in the mask and in the input.
The Patch in the input image, so we get and we sum them up and we get one number for that region. OK, so for each window.
So that's why we get three by three outputs.
So which means if we have thread larger than one, so we will shrink with warm we will make the order foods image smaller than the input image, so that's what we have stretched.
Anne.
If we have a stretch of sweet.
What will happen?
Again, so we have the center of this sliding window here. So if we have moved this sliding window by three, so it'll be a 123, so it will be here, right?
What it will be the next one?
So it'll be 123, so it would be something here.
It doesn't fit.
So we don't have anymore pixels.
So it so it means it can't. We can't apply three by three filter on this sort of service by 7 input image with red 3.
OK.
So it means we need to get this thread size very carefully. So what we already have is we can have thread of 1 two.
Anne, some something else for other input image. So what we usually have for inaudible size, so we have N matters F.
Divided by strides.
Plus one.
So if we have a strategy one so you can see.
From the sliding window. So we will go to the next one.
Here.
So that's what we have why, by what right?
An as the end, so we will get a 5 by 5 image. So you can do it by yourself. So the reason why we get this equation is we manage this future size as we start from this point. So here we don't consider any padding. OK, so we already start from this pixel, that's why it will be smaller than the input image.
And if we apply straight at 2 so we will get a audible image as three by three.
Hannah, you can see if we apply thread of three so it will be 7 -- 3.
7 -- 3 / 3.
It won't work. OK, so it means we can't get that. Get a straight off St for this case. So we need to get the output size using this equation.
So like I mentioned before.
We have padding so you have seen these padding before, right? So we can have zero padding.
We can have the input image here 7 by 7 and we pad zeros around the edges of this input image so zeros you can see.
And also in practice is quite common to use zero padding for these borders, but for sure you can use the same pixel values as these ads. That's another option, but usually what we do is we have zero padding, OK?
And ways this zero paddings. So this input image.
Seem to get a larger size right? So as extend it extends the input image.
So it means.
In this case, we can maintain the input the size of the input image right as we have got this padding.
So again, we have three by three filter and apply with Strahd Strahd one. So in this case as we guess.
There is one pixel border, so the input image. Actually it would be #9, right? So as we have the zero padding.
On both sides on the top and on the bottom. So we have #9.
And if we have the strategy of three, what do we happen?
OK, we use the same equation as before.
So it will be 9 miners a size our future which is 3 and divided by stress plus one. So what we get is full.
Right?
Um?
Sorry, sorry here we have strategize one.
Right?
So in so it means we have 7 by 7.
So before if we have this straight up stretch of.
Shred of 1 if we don't have any padding, so it will be 5 by 5, right?
And it means we have a smaller output image here as we have these paddings so we can maintain the image size so we will get 7 by 7. So that's why we have this padding.
So this is another example.
So we have these inputs.
So we get this 7 by 7 output, so the size will be equal to the input size. So it means we maintain the size of the of the image. So sometimes imagine we have one input image and we want to segment this image. So that's the semantic segmentation we have seen before, right? So it means the old food image will the size of the image will be equal to.
The input image, the size of the input image, so that's why we need this pay.
And in general, so it's quite common to see convolutional layers with straight one futures of science an F by F and zero padding with F -- 1 / 2. This will preserve the size of a specially.
So for example, if we have a future of three, so we can O pad with.
With this 0.0 padding one.
And if we have the future size at 5 by 5, so we can surpass with two pixels along these edges. And also if we have feature size at 7 so we can zero pathways three pixels along this ad. It's OK, that's how we maintain there we can. We can preserve the size.
So now we have seen this convolution padding and stress and we have seen them in our traditional computer vision.
Flexures, right? So, but income in deep learning. For vision. We still use convolution. We use these filters using these masks.
But we have another.
Staff so that is layers, so we make these features to construct convolutional layers.
So you can see here we have this input image.
So like we have submitted by suggested by three, so 32 will be its heads. City Two will be is West and three is RGB switch channels of the inputs, right? So that's what we have seen before.
And then we have one future.
So we have 5 by 5 by three future and you can notice that.
OK, we have.
Are the special size of this image.
5 by 5 that's what we have seen before, but here we have another dimension 3 and also you can see the dimension of here. 3 is equal to the number of channels of the input, so 3.
So that's how we convolve the future. This volume filter with the volume input. OK, so align this volume with one volume.
Of the input.
OK, and then we convolve the future with the image and we slide over the image specially and computing dot products of these elements in the future and in the path of the input image.
So you can see here. Like I mentioned, the futures always extends the full depth of the of the input volume, so you can see here they are the same.
So that's only in this case. We can align this volume against this volume, so we need to make the depth of these two the same.
And we can relate this back to the new networks. Is artificial neural networks, as I said before, convolutional neural networks are special form of artificial neural networks.
I'm here as we have the input as images instead of like a numbers.
But the the concept is the same, so here we have the width as these features.
So here we have a certified by 5 by three filter.
And we have the X.
These images
so the two battery to pass three base 3 dimensional imports.
An we get the odd force using this WX plus B so you can see here we have the width.
And we have the offset.
OK, and the result of taking a dot product between the future and the small 5 by 5 space 3 Patch of the image.
So you can see here we allow there is a 5 by 5 by three. Volume weighs 1 volume of the input.
So also 5 by 5 by three and they allow with each other and for each location so we get the modification and then we get the output and we sum them up and we get the output for that specific.
EXO so we get one number.
OK, let's have look how we do for these convolutional layer so you can see here we start from the top left of the image and we slide this input this future from the left to the Reds.
And again, so from the top to the bottom.
And through this we will arrive at the bottom right of the image and then we get this output.
OK, so you can see here as we are converting.
As we are sliding this future.
Away is this input image from the top left from the top to bottom from left to right, and then we will generate one output.
And we call it activation map.
So you can see here, so it would be the same as we did before.
To get this special size so we have N as the number as the size of the input and we have F as the size of the future.
Now we have an manners F divided by stress we get the size of the output.
So we have such a two minors this filter size which is 5 divided by the strata. The size of the stretch.
And here we have one straddle one, and we have plus one. So we have.
28
So we get this special size kinda 8 by 28 and we said before for each volume or so between these input filter and the Patch we get only one number for these elements for each location, right?
So it means after away allowing all of these locations, we work as one for each location and we guess 28 by 28 by 1 as the outputs activation map.
So in this case we only get one that's steps over one. After convolving these future over all special locations.
But here is a problem. So in the inputs we have a depth of three. But in audibles we only have a depth of 1.
So in some sense we may lose some information, right?
So is it possible to get more information from the input image?
It is possible, and as before we only used one filter and all of the articles was generated by this future. If we apply different filters we will generate different activation Maps. So now we apply another filter 5 by 5 by three future.
And as this is the size of the future, in the same way, what generates an activation map of the same output size. But here remember, so we have a different future. So which means the numbers in this future may be different from the papers one and it means we can have a different output. So for example you may remember in our previous lectures we may have this.
X derivative so and why directives so, for example, for the first for the first future we have X derivative, and for this one we have Y derivative, so they will attract different features from from the input image, right? So for this one it will it will be generated from extra Dave. This one is there are generated from the wide directive, so they represent different different features.
Of the input image.
And also we can have several of these futures and also we can increase that. We can guess the depth size larger than the depth size of the input. So for example here we apply 6 features an in the outputs will get 28 by 28 by 6.
As the output.
As we have 6 features, the world gets 6 separates activation Maps.
Away, stack all of these activation Maps and we get the new outfalls or we call it GNU image so it has a size of 28 by 28 by 6. That's our output.
And we can have different settings. So if we have six futures and we have 5 by 5 by three futures and with red one, and we packed too. So before we didn't have this padding.
What output volume size will we get?
I will give you 5 seconds to think about it.
First of all, so we have six futures.
So we have.
The tab size has 6.
And for the rest, so we have a -- F divided by stress.
Man plus one. So here we have two.
P. Right?
So that's how we guess as we have padding.
So we have started 2 + 2 multiply 2.
Manners 5.
Divided by 1.
Plus one, and we get this special size, so this is 32. So as a result we get 32 by 32 by 6. That's our output. So the two by City 2 by 6.
So that's for another case. We have a thread of 1, an iPad 2.
So for each convolution layer, so we have some pictures OK and we have these imports and audibles. So for the input. Actually that's the output of the previous layer. OK it has a volume of size.
W 1 multiply each one D1 OK, so we have the wise heads and depth and we have 4 hyperparameters.
So the number of futures which determines the the depth of the output. OK, the number of futures and we have the future size F.
And we have the strength as and the amounts of the zero padding P.
And then we can guess the size of the output volume.
So like we had before, so we have this 1 -- F + 2 P divided by as plus one you are a. What we had will be OK W 1 equal to each one and W 2 equal to H2. So the height will be equal to the.
The West will be equal to the head, so we have a square image. That's what we usually have, but member. So that's how we get the output size.
And we now we have these special spatial size and we have the size of the depths.
As the number of futures K and then we can get the size of the output volume up to multiply WW2 by two by D2, that's how we guess the.
Port volume size.
I know we have some common settings for convolutional layers, so early we have the key as powers of two, so the only we have we may have 30, two, 64120, eight, 500.
Have OK an we may have.
There is a common settings for future size stride and padding. OK, we may have future size as three thread one padding as one or we have other settings. So these are just the common ones we may use.
So now we have seen convolution. How it works? So basically it's exactly the same as we had before future or we call it a mask and then we have seen the convolutional layer. So we use these future to slide over the input the.
The input of an input image of this layer.
How we do this convolution and we get the output OK and we have seen how to guess the size of the output obviously. So now we have say one layer of convolutional neural networks. If we stack several several layers of convolution layers, we get.
Convolutional neural networks, or in short we call it covenant.
OK, that's what we had sent before, right? So we have these input images.
And we have these automods. So here we have these filter filters. We have 65 by 5 by three futures and we guess this.
Output.
You may find that so for this filtering, so we apply this linear transformation of the inputs. You may remember in artificial neural networks we need the nonlinear transformation of the inputs.
Nonlinear transformation of the inputs as nonlinear features are very common in in the in our world, so so we need both linear transformation and nonlinear transformation of the of the input.
And we have different options, so we have seen thresholds wlodyka units. We have seen this Relu, so here we use Relu rectifies in your unit.
2.
To include some nonlinear features for the outputs again.
So you really want we have is we apply this conclusion to the inputs and we apply Relu to get the nonlinear features, and then we get the output to 8 by 20 by 6.
And then we apply another side of filters. So we have to impose and we get another we after applying these futures weekend outputs and this output will be the input for the next layer.
So we apply more filters. So here we have 10, five by 5 by 6 filters. And then we also apply Relu based activation functions and then we get another layer, the OR the output.
So and then we apply these some other futures and we apply some value activities functions and we get layers by layers OK.
So you can see compilation the neural networks are sequence of convolution layers, an intersect interspersed with activation functions.
OK, that's all we have for convolutional neural networks.
So.
This is the input image.
And then we apply futures an activation functions and then we can get some features.
So for example, we get some low level features.
These corners these lines in the image and we apply more filters to the low level features and then we get mid level features.
So for example this ones.
And like there's a textures of the dog, and then we apply further features so that we get high level features like this ones. So for this one, so we may not be able to interpret these high level features.
So that's why we use these different your networks to learn these repetitions. They may not be interpretable, but they may get a very good performance for tax, like recognition or segmentation. And then we use is.
Classifier to get the output. So what we already have is we have softmax classifier, that's the one we have seen before, so it will have some nodes and we have some ways to predict the label of the input image.
So for example, here we have a dog, so we pretty even number OK this image have a cloud has a class of a dog, so this is a classifier for sure. We have many other class classifiers so but for classification usually what we really have is a softmax layer to predict the label OK, and that's what we have for the conclusion convolutional neural networks. So input image.
A few layers or future or filtering an activation functions and we have one final classifier to predict the label. OK, that's all we have.
An in in.
In this convolutional neural networks, so we have seen these convolutional layers, but in them, so we have two more types of layers, so we have pulling and we have fully connected layers.
So for pooling layer, so we it will make the repetitions smaller and more manageable, and if it warm operates over each activation independently.
So for example, if we have a very large image, sometimes we may need a lot of space to process the input image. So we need this pooling layer to reduce the size of the these repetitions so that it can be better managed.
So you can see here we have these.
We have these activities activation map of 220, five, 24 by 224 by 64 and then we apply pulling. Actually, it is essentially done something OK.
So we downsample eight 224 of this 210 so we we have these 64 slices of this model, right? So for each slice, so we downsample it to 120 one 105 by 112. So we done something it by two and it will be one slice of.
Of the outputs.
So you can see here basically what we do is to reduce the special size of H slides. So we have the article as 412 by 112 by 64.
And four H down something so we can have.
So for example, for a slice it will generate a smaller size.
Out full size and we stack some together and then we can get the output.
So here is an example of coding.
So here we have Max party so we have these imports slice like 4 by 4 an what do we generate? Is a two by two output an for putting so we divide these import images into a two by two grid for each grade written. So we get the maximum in that region. So for for this part. So we have the maximum.
At 6:00 so we can 6 here and for this reason, so we get 8, so we put it here and for this part we have the output at three and hear the bug.
Always 4 so we put 4 here.
And that's how we get Mark Sporting.
OK, and also we have these hyperparameters.
So we have these futures and we have the strength.
So for futures we also have this sliding window, so it is 2 by two sliding window.
And we have this thread.
So basically it's a step size in moving OK, so we have a stretch of two.
And we get these articles as two by two, so you can take it as a special form of convolution. OK for this folding. So still we have a -- F divided by strides plus one.
So you can see here we have the future size F and we have this red F as well as and we use the same form to calculate the output size.
So you can see here we have the exactly.
A matters F divided by St Park plus one. So here we don't do. We don't usually use padding. OK, it's not uncommon to use zero padding for pooling layers, as is reducing the size of the office activation Maps, so that's why we we don't usually use their party, and we also have some common settings for pulling.
So what we already have is we have the future size at two by two and we have. This thread has two.
Sometimes we have future of three by three.
And we have the strategies too.
Another important layer is fully connected layer. Actually this is the softmax layer I mentioned before.
So here is a problem. So before we have these convolutional layers and all of these activation Maps there in 3 dimensional.
OK.
So they have tabs that they have. Had they have West.
But
so for classification problems we need to predict we need to predict the probabilities of each label. So basically is a vector right? So for example, if we have 10 class, so for this input we have 1.3 for this class there .1 for this class there point 0.5.
For this class.
So we have a.
10 by 1 vector. So there is numbers are the probabilities of having this image for that car for these classes? OK, so we have the label, we have this. We have one vector as the output.
So the problem is how to map. There is a 3 dimensional volume to 1.
Vector.
So what we can do is to stretch.
This is volume into one vector. OK, so we put all of these Members in this volume in one vector one by one, so we get.
2072 so basically 32 by 32 by three so we stretch it into a vector and then we connect is node.
In this vector.
Choose the odd footnote.
So essentially this is this is an artificial neural network, right? So we have double X, so this is we have 10 by 3072 ways as we connect each node.
Anne.
In the in this layer and in this layer together and we have these ways.
And that's the fully connected layer. That's the last layer we have, so you can see here. Put them together. We have these convolutional layers and we can reduce to get the nonlinear features. And we have this pulling, so sometimes we want to reduce the size of activation activation Maps.
So we apply pooling.
And this by stacking all of these layers together and at the end we stress the activation map this volume into a vector and connected with the output layer. So we have this fully connected layer to predict the probabilities of these classes.
Actually this this is kind of a histogram, right? So we have one number, one probability number for each class.
And this is.
Convolutional neural networks. That's how we have. There's a convolutional neural network to classify this input image.
To wrap up our today's lecture. So first we have recapped the conclusion filtering operations and we put it in the context of convolutional neural networks, and for each convolution layer. So we slide this future along this input image from the top to bottom from left to right, and we get this audibles and four is.
Location we get one number so we get modifications, but if we multiply the elements in this corresponding this filter and also the input volume and we get one number for that location and we get one activation map. And if we apply multiple filters so we will get multiple activation Maps.
And we can meet in the depths of this activation Maps OK, and we can have even have more. We can have even have deeper activation Maps by having different numbers or futures.
And after that we pulls up together and we we have layers of convolution convolution layers.
An by having these convolution layers.
And also we have this Relu functions and we have the and. Also we have polling layers to manage the size of this activation function. This activation Maps and as ends with stretch there is a division map into a vector and connected to the output layer. So we have this fully connected layer so as a whole we have convolutional neural networks.
And that's all for our today's lecture and the next one.
We will cover network training. OK so we will see some technical details of training our network. OK, see you in the next one.