# Lesson 6 Deep Learning 2019 Regularization; Convolutions; Data ethics

Just another WordPress site

### Lesson 6 Deep Learning 2019 Regularization; Convolutions; Data ethics

when we’re actually doing that those wait updates to that propagation the training time dropout works the way we just saw at test time we turn off dropout but we’re not going to do dropout anymore because we wanted to be as accurate as possible we’re not training so we can’t cause it to overfit when we’re doing inference so we remove dropout but what that means is if previously P was point OV was 0.5 then half the activations were being removed which means when they’re all there now our overall activation level is twice what it used to be and so therefore in the paper they suggest multiplying all of your weights at test time by P interestingly you can dig into the PI torch source code and you can find the actual C code where dropout is implemented and here it is and you can see what they’re doing is something quite interesting they first of all do a Bernoulli trial so a Bernoulli trial is with probability 1 minus P return the value 1 otherwise return the value 0 that’s all it means ok so in this case P is the probability of dropout so 1 minus P is a probability that we keep the activation so we end up here with either a 1 or a 0 ok and then this is interesting we divide in place remember underscore means in place in plate which we divide in place that 1 or 0 by 1 minus P if it’s a zero nothing happens is still zero if it’s so 1 and P was 0.5 that one now becomes 2 and then finally we multiply in place our input by this noise this dropout mask so in other words we actually don’t do imply torch we don’t do the change at test time we actually do the change at training time which means that you don’t have to do anything special at inference time with play torch let’s not just plate watch it’s quite a common pattern but it’s kind of nice to look inside the pipe watch source code and see you know drop out this incredibly cool incredibly valuable thing is really just these three lines of code which they do in C because I guess it ends up a bit faster when it’s all fused together but lots of libraries do it in Python and that works well as well you can even write your own drop out layer and it should give exactly the same results as this so that’d be a good exercise to try see if you can create your own drop out layer in Python and see if you can replicate the results that we get with this drop out there so that’s drop and so in this case we’re going to use a tiny bit of drop out on the first layer and a little bit of drop out on the next layer and then we’re going to use special drop out on the embedding layer now why do we do special drop out on the embedding layer so if you look inside the FASTA a source code is our tabular model you’ll see that in the section that checks that there’s some embeddings we call it embedding and then we concatenate the embeddings into a single matrix and then we call embedding drop out an embedding drop out is simply just a drop out right so it’s just an instance of a drop out module this kind of makes sense right for continuous variables that continuous variable is just in one column you wouldn’t want to do drop out on that because you’re literally deleting the existence of that whole input which is almost certainly not what you want but for an embedding and embedding is just effectively a matrix multiplied by a one hot encoded matrix so it’s just another layer so it makes perfect sense to have dropout on the output of the embedding because you’re putting drop out on those activations of that layer and so you’re basically saying let’s delete that random some of the results of that embedding some of those activations so that makes sense the other reason we do it that way is because I did very extensive experiments about a year ago where on this data set I tried lots of different ways of doing kind of everything and you can actually see it here I put all of the spreadsheet of course Microsoft Excel put them into a pivot table to summarize them all together to find out kind of which different choices and hyper parameters and architectures worked well and worked less well and then I created all these little graphs and these are like little summary training graphs for different combinations of high parameters and architectures and I found that there was one of them which ended up consistently getting a good predictive accuracy the kind of

bumpiness of the training was pretty low and you can see on it was just a nice smooth curve and so like this is an example of the kind of experiments that I do that end up in the first day I library right so embedding embedding dropout was one of those things that I just found work really well and basically these the results of these experiments is why it looks like this rather than something else well it’s a combination of these experiments but then why did I do these particular experiments well because it was very influenced by what worked well in that cattle prizewinners paper but there are quite a few parts of that paper I thought there were some other choices they could have made I wonder why they didn’t and I tried them out and found out what actually works and what doesn’t work as well and found a few little improvements so that’s the kind of experiments that you can play around with as well when you try different models and architectures different dropouts layer numbers number of activations and so forth so I’m having created a learner we can type learned up model to take a look at it and as you would expect in that there is a whole bunch of embeddings each of those embedding matrices tells you well this is the number of levels of the input for each input right and you can match these with with your list cat bars right so the first one will be store so that’s not surprising there are a thousand 116 stores and then the second number of course is the size of the embedding and that’s a number that you get to choose and so fast AI has some defaults which actually work really really well nearly all the time so I almost never changed them but when you create your tabular Lerner you can absolutely pass in an embedding size dictionary which Maps variable names to embedding sizes for anything where you want to override the defaults and then we’ve got our embedding dropout layer and then we’ve got a batch norm layer with 16 inputs okay the 16 inputs make sense because we have 16 continuous variables the length of continents is 16 so this is something for our continuous variables and specifically it’s over here the N conte on our continuous variables and BN Conte is a batch norm one d what’s that well the first short answer is it’s one of the things that I experimented with as to having batch normal not in this and I found that it worked really well and then specifically what it is is extremely unclear let me describe it to you it’s kind of a bit of regularization it’s kind of a bit of training helper it’s called batch normalization and it comes from this paper actually before I do this I was want to mention one other really funny thing dropout I mentioned it was a master’s thesis not only was it a master’s thesis one of the most influential papers of the last ten years it was rejected from the main neural Nets conference what was then called nips now called near its I think this is just it’s very interesting because it’s just a reminder that you know a our academic community is generally extremely poor at recognizing which things are going to turn out to be important generally people are looking for stuff that are in the field that they’re working on and understand so drop out kind of came out of left field it’s kind of hard to understand what’s going on and so that’s kind of interesting and so you know it’s a reminder that if you just follow you know as you kind of develop beyond being just a practitioner into actually doing your own research don’t just focus on the stuff everybody’s talking about focus on the stuff you think might be interesting because the stuff everybody is talking about generally turns out not to be very interesting the community is very poor at record high-impact papers when they come out match normalization on the other hand was immediately recognized as high-impact I definitely remember everybody talking about it in 2015 when it came out and that was because it was so obvious they showed this picture showing the current then state-of-the-art image net model inception this is how long it took them to get you know a pretty good result and then they tried the same thing with this new thing core batch norm and they just did it way way way quickly and so that was enough for pretty much everybody to

any questions in what proportion would you use dropout versus other regularization errors like weight decay L two norms etc so remember that l2 regularization and weight decay a kind of two ways of doing the same thing and we should always use the weight decay version not the l2 regularization version so there’s weight decay there’s batch norm which kind of has a regularizing effect there’s data augmentation which we’ll see soon and this drop out so that’s normally pretty much you always want so that’s easy data augmentation we’ll see in a moment so then it’s really between dropout versus weight decay I have no idea I don’t I don’t think I’ve seen anybody to fight a compelling study of how to combine those two things can you always use one instead of the other why why not I don’t think anybody has figured that out I think in practice it seems that you generally want a bit of both you pretty much always want some weight decay but you often also want a bit of dropout but honestly I don’t know why I’ve not seen anybody really explain why or how to decide so this is one of these things you have to try out and kind of get a feel for what tends to work for your kinds of problems I think the defaults that we provide in most of our learners should work pretty well in most situations but yeah definitely play around with it okay the next kind of regularization we’re going to look at is data augmentation and data augmentation is one of the least well studied types of regularization but it’s the kind that I think I’m kind of the most excited about the reason I’m kind of the most about it is that you basically there’s basically almost no cost to it you can do data augmentation and get better generalization without it taking longer to train without underfitting to an extent at least so let me explain so what we’re going to do now is we’re going to come back to computer vision and we’re going to come back to our pets data set again so let’s let’s load it in and our pets data set the images were inside the images subfolder and then a call get transforms as per usual but when we call get transforms there’s a whole long list of things that we can provide and so far we haven’t been varying that much at all but in order to really understand data augmentation I’m going to kind of ratchet up all of the defaults so there’s a parameter here for what’s the probability of an affine transform happening what’s the probability of a light lighting transfer happening so I set them both to one so they’re all gonna get transformed they’re going to be more rotation more zoom more lighting transforms and more warping what are all those mean well you should check the documentation and you do that by typing doc and there’s a dot the brief documentation but the real documentation is in dogs so I’ll click on show in Doc’s and here it is right and so this tells you what all those do but generally the most interesting parts of the Doc’s tend to be at the top where you kind of get the summaries of what’s going on and so here there’s something called list of transforms and here you can see every transform has a something showing you lots of different values of it right so here’s brightness so make sure you read these and remember these notebooks you can open up and run this code yourself and get this output all of these note all of these HTML documentation documents are auto-generated from the notebooks in the docs underscore source directory in the FASTA a repo okay so you will see the exact same cats if you try this so if I really likes cats so there’s a lot of cats in the documentation and I think you know because he’s been so awesome at creating great documentation he gets to pick the cats so so for example looking at different values of brightness what I do here is I look to see two things the first is for which of these levels of transformation is it still clear what the picture is a picture of so this is kind of getting to a point where it’s pretty unclear this

reflected at the bottom here so reflection padding tends to create images that are kind of much more naturally reasonable like in the real world you don’t get black borders like this so they do seem to work better okay so because we’re going to study convolutional neural networks we are going to create a convolutional neural network you know how to create them so I’ll go ahead and create one I will fit it for a little bit I will unfreeze it I will then create a larger version of the data set 352 by 352 and fit for a little bit more and I will save it okay we have a CNN and we’re going to try and figure out what’s going on and our CNN and the way we’re going to try and figure it out is expenses specifically that we’re going to try to learn how to create this picture this is a heat map right this is a picture which shows me what part of the image did the CNN focus on when it was trying to decide what this picture is so we’re going to make this heat map from scratch when we so we’re kind of at a point now in the course where I’m assuming that if you’ve got to this point you know and you’re still here thank you then you’re interested enough that you’re prepared to kind of dig into some of these details so we’re actually going to learn how to create this heat map without almost any fast a eye stuff we’re going to use pure kind of tensor arithmetic in PI torch and we’re going to try and use that to really understand what’s going on so to warn you none of it’s rocket science but a lot of its going to look really new so don’t expect to get it the first time but expect to like listen jump into the notebook try a few things test things out look particularly at like tensor shapes and inputs and outputs to check your understanding then go back and listen again but and kind of try it a few times because you will get there right it’s just that there’s going to be a lot of new concepts because we haven’t done that much stuff in purified watch okay so what we’re going to do is going to have a seven minute break and then we’re going to come back and we’re going to learn all about the innards of a CNN so I’ll see you at 7:50 so let’s learn about convolutional neural networks you know the funny thing is it’s pretty on you you get close to the end of a course and only then look at convolutions but like when you think about it knowing actually our batch norm works so how dropout works or how convolutions work isn’t nearly as important as knowing how it all goes together and what to do with them and how to figure out how to do those things better but it’s you know we’re kind of at a point now where we want to be able to do things like that and although you know where we’re adding this functionality directly into the library so you can kind of run a function to do that you know the more you do the more you’ll find things that you want to do a little bit differently to how we do them or there’ll be something in your domain where you think like oh I could do a slight variation of that so you’re kind of getting to a point in your experience now where it helps to know how to do more stuff yourself and that means you need to understand what’s really going on behind the scenes so what’s really going on behind the scenes is that we are creating a neural network that looks a lot like this right but rather than doing a matrix multiply here and here and here we’re actually going to do instead a convolution and a convolution is just a kind of matrix multiply which has some interesting properties you should definitely check out this website certo certo / Eevee explain visually where we have stolen this beautiful animation it’s actually a JavaScript thing that you can actually play around with yourself in order to show you how convolutions work and it’s actually showing you a convolution as we move around these little red squares so here’s here’s a picture a black and white or grayscale picture right and so each 3×3 bit of this picture is this red thing moves around it shows you a different 3×3 part right it shows you over here

the values of the pixels right so in first a ice case our pixel values are between Norton one in this case there between Norton 255 right so here are nine pixel values this area’s pretty weight so they’re pretty high numbers yeah and so as we move around you can see the nine big numbers change and you can also see their colors change up here is another nine numbers and you can see those in the little X 1 X 2 X 1 here 1 2 1 now what you might see going on is as we move this little red block as these numbers change we then multiply them by the corresponding numbers up here and so let’s start using some nomenclature the thing up here we are going to call the kernel the convolutional kernel so we’re going to take each little 3×3 part of this image and we’re going to do an element-wise multiplication of each of the 9 pixels that we are mousing over with each of the 9 items in our kernel and so once we multiply each set together we can then add them all up and that is what’s shown on the right as the little bunch of red things move over there you can see there’s one red thing that appears over here the reason there’s one red thing over here is because each set of 9 after getting through the element-wise multiplication of the kernel get added together to create one output so therefore the size of this image has one pixel less on each edge than the original as you can see see how there’s black borders on it that’s because at the edge the 3×3 kernel can’t quite go any further right so the farthest you can go is to end up with a dot in the middle just off the corner ok so why are we doing this well perhaps you can see what’s happened this phase has turned into some white parts outlining the horizontal edges how well the how is just by doing this element wise multiplication of each set of nine pixels with this kernel adding them together and sticking the result in the corresponding but over here why is that creating white spots with the horizontal edges are well let’s think about it let’s look up here so if we’re just in this little bit here right then the spots above it are all pretty white so they have high numbers so the bits above it big numbers who are getting multiplied by one to one so that’s going to create a big number and the ones in the middle are all zeros so don’t care about that and then the ones underneath are all small numbers because they’re all close to zero so that really doesn’t do much at all so therefore that little set there is going to end up with bright white okay where else on the other side right down here you’ve got light pixels underneath so they’re going to get a lot of negative dark pixels on top which are very small so not much happens so therefore over here we’re going to end up with very negative so this thing where we take H 3×3 area and element wise multiply them with a kernel and add each of those up together to create one output is called a convolution that’s it that’s a convolution so that might look familiar to you right because what we did back a while ago is we looked at that Zeiler and furgus paper where we saw like each different layer and we visualized what the weights were doing and remember how the first layer was basically like finding diagonal edges and gradient that’s because that’s what a convolution can do right HML layers is just a convolution so the first layer can do nothing more than this kind of thing but the nice thing is the next layer could then take the results of this right and it could kind of combine one channel what’s a one or the output of one convolutional field is called a channel right so could take one channel that found top edges and another channel that finds left edges and then the layer above that could take those two as input and create something that finds top left corners as we saw when we looked at those Zeiler

and furgus visualizations so let’s take a look at this from another angle or quite a few other angles and we’re going to look at a fantastic post tramatic or mac line Smith who was actually a student in the first year that we did this course and he wrote this as part of his project work back then and what he’s going to show here is here’s our image it’s a three by three image and our kernel is a 2 by 2 kernel and what we’re going to do is we’re going to apply this kernel to the top left 2×2 part of this image and so the pink bit will be correspondingly multiplied by the pink bit the green by the green and so forth and they all get added up together to create this top left in the output so in other words P equals alpha times a beta times be gamma times D Delta times a there it is plus B which is a bias okay so that’s fine that’s just a normal bias so you can see how basically each of these output pixels is the result of some different linear equation that makes sense and you can see these same four weights are being moved around because this is our convolutional kernel here’s another way of looking it up from that which is here is a classic neural network view and so P now is a result of multiplying every one of these inputs by a weight and then adding them all together except the gray ones I’ve got to have a value of zero right because remember P was only connected to a B D and E a B D P so in other words remembering that this represents a matrix multiplication therefore we can represent this as a matrix multiplication so here is our list of pixels in our 3×3 image flattened out into a vector and here is a matrix vector multiplication plus by us and then a whole bunch of them we’re just going to set to zero alright so you can see here we’ve got a zero zero zero zero zero which corresponds to zero zero zero zero zero so in other words a convolution is just a matrix multiplication where two things happen some of the entries are set to zero all the time and all of the ones are the same color always have the same weight so when you’ve got multiple things with the same weight that’s called weight time okay so clearly we could implement a convolution using matrix multiplication but we don’t because it’s slow at swim practice our libraries have specific convolution functions that we use and they’re basically doing this which is this which is this equation which is says the same as this matrix multiplication and as we discussed we have to think about padding because if you have a 3×3 kernel and a 3×3 image then that can only create one pixel of output there’s only one place that this 3×3 can go so if we want to create more than one pixel of output we have to do something called padding which is to put additional numbers all around the outside so what most libraries do is that they just put a layer of zeros no layer a bunch of zeros of all around the outside so for 3×3 kernel a single zero on every edge piece here and so once you’ve pattern it like that you can now move your 3×3 kernel all the way across and give you the same output size that you started with okay now as we mention in fast AI we don’t normally necessarily use zero padding where possible we use reflection padding although for these simple convolutions we often use zero padding because it’s doesn’t matter too much in a big image it doesn’t make too much difference okay so that’s what a convolution is so a convolutional neural network wouldn’t be very interesting if it can only create top edges so we have

bottom-right edges okay so that’s my tensor now one complexity is that that 3×3 kernel cannot be used for this purpose because I need two more dimensions the first is I need the third dimension to say how to combine the red green and blue so what I do is I say don’t expand this is my 3×3 and I pop another three on the start what don’t expand does is it says create a 3 by 3 by 3 tensor by simply copying this one 3 times I mean honestly it doesn’t actually copy it it pretends to have copied it you know but it just basically refers to the same block of memory so it kind of copies it in a memory efficient way so this one here is now 3 copies of that and the reason for that is that I want to treat red and green and blue the same way for this little manual kernel I’m showing you and then we need one more access because rather than actually having a separate kernel like I’ve kind of printed these as if they were multiple kernels what we actually do is we use a rank 4 tensor and so the very first access is for the every separate kernel that we have so in this case I’m just going to create one kernel so to do a convolution I still have to put this unit access on the front so you can see k dot shape is now 1 comma 3 comma 3 comma 3 so it’s a 3×3 kernel there are three of them and then that’s just the one kernel that I have so it kind of takes awhile to get the feel for these higher dimensional tensors because we’re not used to writing out the for D tensor but like just think of them like this so for D tensor is just a bunch of 3d tensors sitting on top of each other ok so this is our four D tensor and then you can just call con 2 D passing in some image and so the image I’m going to use is the first part of my validation data set and the kernel there’s one more trick which is that M pi torch pretty much everything is expecting to work on a mini-batch not on an individual thing okay so in our case we have to create amen batch of size 1 so our original image is 3 channels by 352 by 352 hoped by width let’s remember paid Watchers channel by height by width I want to create a mini design I need to create a rank 4 tensor where the first axis is 1 in other words it’s a mini batch of size 1 because that’s what pipe watch expects so there’s something you can do in both pi George and numpy which is you can index into an array or a tensor with a special value none and that creates a new unit axis in that point point so T is my image of dimensions 3 by 3 52 by 352 T none is a rank 4 tensor a mini batch of one image of 1 by 3 by 3 52 by 352 and so now I can go according to D and get back ok specifically my Maine ok so that’s how you can play around with convolutions yourself so how we going to do this to create a heat map this is where things get fun remember what I mentioned was that I basically have like my input red green blue and it goes through a bunch of convolutional layers let us write a little line to say a convolutional layer to create activations which have more and more channels and eventually less and let smaller and smaller height by widths until eventually remember we looked at the summary we ended up with something which was 11 by 11 by 512 and there’s a hope there’s a whole bunch more layers

that we skipped over now there are 37 classes because remember dado dot c is the number of classes we have and we can see that at the end here we end up with 37 features in our model so that means that we end up with a probability for every one of the 37 breeds of cat and dog so it’s a vector of length 37 that’s our final output that we need because that’s what we’re going to compare implicitly to our one hot encoded matrix which will have a 1 in the location for Maine yeah so somehow we need to get from this 11 by 11 by 512 to this 37 and so the way we do it is we actually take the average of every one of these 11 by 11 faces we just take the mean so we’re going to take the mean of this first face take the mean that gives us one value and then we’ll take the second of the 512 faces and take that mean and that’ll give us one more value that so we’ll do that for every face and that will give us a 512 long vector okay and so now all we need to do is pop that through a single matrix multiply of 512 by 37 and that’s going to give a an output vector of length 37 okay so this step here where we take the average of each face is called average pooling so let’s go back to our model and take a look there it is here is our final 512 and here is we will talk about what a concat pooling is in part two for now we’ll just focus on this this is a fast AI specialty everybody else just does this average pool average pool duty with an output size of one so here it is output average pool 2d with an output size of one and then again there’s a bit of a special faster I think that we actually have two layers here but normally people then just have the one linear layer with the input of 512 and the output of 37 okay so what that means is that this little box over here where we want a one for Maine we’ve got to have a box over here which needs to have a high value in that place so that the loss will be low so if we’re going to have a high value there the only way to get it is with this matrix multiplication is that it’s going to represent a simple weighted linear combination of all of the 512 values here so if we’re going to be able to say I’m pretty confident this is a main cone just by taking the weighted sum of a bunch of inputs those inputs are going to have to represent features like how fluffy is it what color is it snows how long as its legs how point here its ears you know all the kinds of things that can be used because for the other thing which figures out is this a bulldog it’s going to use exactly the same kind of 512 inputs with a different set of weights because that’s all a matrix multiplication is right it’s just a bunch of weighted sums a different weighted sum Chaplin okay so therefore we know that this you know potentially dozens or even hundreds of layers of convolutions must have eventually come up with an 11 by 11 face for each of these features saying in this little bit here how much is that part of the image like a pointy ear how much is it fluffy how much is it like a long leg how much is it like a very red nodes all right so that’s what all of those things must represent so each face is what we call each of these represents a different feature okay so the outputs of these we can think of as different features so what we really want to know then is not so much what’s the average across the 11 by 11 to get this set of

outputs but what we really want to know is what’s in each of these 11 by 11 spots so what if instead of averaging across the 11 by 11 let’s instead average across the 512 if we average across the 512 that’s going to give us a single 11 by 11 matrix and each item each each grid point in that 11 by 11 matrix will be the average of how activated was that area when it came to figuring out that this was a Maine how many signs of Maine ish nurse was there in that part of the 11 by 11 grid and so that’s actually what we do to create our heat map so I think maybe the easiest way is to kind of work backwards here’s our heat map and it comes from something called average activations and it’s just a little bit of matplotlib and fast a fast AI to show the image and then plot lib to take the heat map which we passed in which was called average activations hm for heat map alpha 0.6 means make it a bit transparent and matplotlib extent means expand it from 11 by 11 to 352 by 352 he is by linear interpolations it’s not all blocky and use a different color map to kind of highlight things that’s just some a plot lab is not important the key thing here is that average activations is the 11 by 11 matrix we want to it here it is average activations touch shape is 11 by 11 so to get there we took the mean of activations across dimension 0 which is what I just said in PI torch the channel dimension is the first dimension so the main across dimension 0 took us from something of size 512 by 11 by 11 as promised so something of 11 by 11 so therefore activations axe contains the activations we’re averaging where did they come from they came from something called a hook so a hook is a really cool more advanced PI torch feature that lets you as the name suggests hook into the PI torch machinery itself and run any arbitrarily python code you want to it’s a really amazing and nifty thing because you know normally when we do a forward pass through a PI torch module it gives us this set of outputs but we know that in the process it’s calculated these so why would I what I would like to do is I would like to hook into that forward pass and tell PI torch hey when you calculate this can you store it for me please ok so what is this this is the output of the convolutional part of the model so the convolutional part of the model which is everything before the average pool is basically all of that but and so thinking back to try for learning right you remember with transfer learning we actually cut off everything after the convolutional part of the model and replaced it with our own little bit right so with fast AI the original convolutional part of the model is always going to be the first thing in the model and specifically it’s always going to be called assuming so in this case I’m taking my model and I’m just going to call it M right so you can see M is this big thing but always at least in faster I always m0 will be the convolutional part of the model so in this case we created a let’s go back and see we created a resin at 34 so the the main part of the resin at 34 though the pre-trained bit we hold on to is in M 0 and so this is basically it this is a printout of the rezident 84 and at the end of it there is the 512 activations so what in other words what we want to do is we want to grab em 0 and we want to hook its output so this is a really useful thing to be able to do so fast D is actually created something to do it for you which is literally you say hook output and you pass in the PI torch module that you want to hook the output of and so most of the most likely the thing you want to