### Lesson 6 Deep Learning 2019 Regularization; Convolutions; Data ethics

all right welcome to lesson 6 where we’re going to do a deep dive into computer vision convolutional neural networks what is a convolution and we’re also going to learn the final regularization tricks after last lesson learning about weight decay and /lt regularization I want to start by showing you something that I’m really excited about and I’ve had a small hand and helping to to create for those of you that saw my talk on ted.com you might have noticed this really interesting demo that we did about four years ago showing a way to quickly build models with unlabeled data it’s been four years but we’re finally at a point where we’re we’re we’re ready to put this out in the world and let people use it and the first people we’re going to let use it are you folks so the company is called platform today I and the reason I’m mentioning it here is that it’s going to let you create models on different types of datasets to what you can do now that is to say datasets that you don’t have labels for yet we’re actually going to help you label them so this is the first time this has been shown before so I’m pretty thrilled about it and let me give you a quick demo when you so if you go to platform AI and choose get started you’ll be able to create a new project and if you create a new project you can either upload your own images uploading it at 500 or so works pretty well you can upload a few thousand but you know to start upload 500 or so they all have to be in a single folder and so we’re assuming that you’ve got a whole bunch of images that you haven’t got any labels for or you can start with one of the existing collections if you want to play around so I’ve started with the cars collection kind of going back to what we did four years ago and so this is what happens when you first draw into platform AI and look at the collection of images you uploaded a random sample of them will appear on the screen and as you’ll recognize probably they are projected from a deep learning space into a 2d space using a pre trained model and for this initial version it’s an image net model we’re using as things move along we’ll be adding more and more pre train models and what I’m going to do is I want to add labels to this data set representing which angle a photo of the car was taken from which is something that actually image that’s going to be really bad at isn’t it because image net has learnt to recognize the difference between cars versus bicycles and imagenet knows that the angle you take a photo on actually doesn’t matter so we want to try and create labels using the kind of thing that actually imagenet specifically learn to ignore so the projection that you see we can click these layer buttons at the top to switch to user projection using a different layer of the neural net right and so here’s the last layer which is going to be a total waste of time for us because it’s really going to be projecting things based on what kind of thing it thinks it is and the first player is probably going to be a waste of time for us as well because there’s very little interesting semantic content there but if I go into the middle in layer 3 we may well be able to find some some some differences there so then what you can do is you can click on the projection button here and you can actually just press up and down rather than just pressing the the arrows at the top to switch between projections or left and right switch between layers and what you can do is you can basically look around until you notice that there’s a projection which is kind of separated out things you’re interested in and so this one actually I notice that it’s got a whole bunch of cars that are kind of from the top front front right over here okay so if we zoom in a little bit we can double check because like yeah that looks pretty good they’re all kind of front right so we can click on here to go to selection mode and we can cut a grab a few and then you should check and so what we’re doing here is we’re trying to take advantage of the combination of human plus machine the the machine is pretty good at quickly doing calculations but as a human I’m pretty good at looking at

a lot of things at once and seeing the odd one out so in this case I’m looking for cars that aren’t front right and so by laying the one on in front of me I can do that really quickly it’s like okay definitely that one so just click on the ones that you don’t want all right it’s all good so then you can just go back and so then what you can do is you can either put them into a new category but I can create a new label or you can click on one of the existing ones so before I came I just created a few so here’s friend right so there’s click on it here there we go okay and so that’s the basic idea is that you kind of keep flicking through different layers or projections to try and find groups that represent the things you’re interested in and then over time you’ll start to realize that there are some things that are a little bit harder so for example I’m having trouble finding sides so what I can do is I can see over here there’s a few sides so I can zoom in here and click on a couple of them like this one and this one that one that one okay I mean I’ll say find similar and so this is going to basically look in that that projection space and not just at the images that are currently displayed but all of the images that you uploaded and hopefully I might be able to label now a few more side images at that point so it’s going through and checking you know all of the images that you’re uploaded to see if any of them have projections in this space which similar to the ones I’ve selected and hopefully we’ll find a few more of what I’m interested in okay so now if I want to try to find a projection that separates the sides from the front right I can click on each of those two and then over here this button is now called switch to the projection that maximizes the distance between the labels so now what this is going to do is going to try and find the best projection that separates out those classes and so their goal here is to you know help me visually inspect and quickly find a bunch of things that I can use to label so like they’re the kind of the the key features and it’s done a good job you can see down here we’ve now got a whole bunch of sides which I can now grab because I was having a lot of trouble finding them before and it’s always worth double-checking it’s kind of interesting to see how the neural Nets behave like there seems to be more sports cars in this group than average as well so it’s kind of found side angles of sports cars so that’s kind of interesting so then I can click all right so I’ve got those four an arrow clicks side and there we go so once you’ve done that a few times I find if you’ve got you know a hundred or so labels you can then click on the train model button and it’ll take a couple of minutes and come back and show you your train model and after it’s trained which I did it on a smaller number of labels earlier you can then switch this very opacity button and it’ll actually kind of fade out the ones that are already predicted pretty well and it’ll also give you a estimate as to how accurate it thinks the model is the main reason I mentioned this for you is that so that you can now click the download button and it’ll download the predictions which is what we hope will be interesting to most people but what I think will be interesting to you as deep learning students is it’ll download your labels so now you can use that labeled subset of data along with the unlabeled set that you haven’t labeled yet to see if you can you know see if you can build a better model and platform a is done for you see if you can use that initial set of data to kind of get going creating models and stuff which you weren’t able to label before clearly there are some things that this systems better that than others for things that require you know really zooming in closely and taking a very very close inspection this isn’t going to work for a wealth is really designed for things that the human eye can kind of pick up fairly readily but we’d love to get feedback as well and you can click on the Help button to get feedback and give feedback and also there’s a platform AI discussion topic in our forum where so our shucks if you can stand up ash that’s the CEO of the company he’ll be there helping out answering questions and so forth so yeah I hope people find that useful it’s been many years getting to this point and I’m glad we’re we’re finally there okay so one of the reasons I wanted to mention this today is that we’re going to be doing a big dive into convolutions later in this lesson so I’m going to circle back to this to try and explain a little

bit more about how that is working under the hood and give you a kind of a sense of what’s what’s going on but before we do we have to finish off last week’s discussion of regularization and so we were talking about regularization specifically in the context of the tabular learner because the tabular learner this was the forward method sorry this is the init method in the tabular learner and our goal was to understand everything here and we’re not quite there yet last week we were looking at the adult data set which is a really simple kind of over simple data set that’s just to toy purposes so this lit week let’s look at a data set that’s much more interesting a cow goal competition data set so we know kind of what the the best in the world and you know Carol competition was results tend to be much harder to beat than academic state of the art results tend to because a lot more people work on Carol competitions than most academic data sets so it’s a really good challenge to try and do well on a Carol competition data set so this one the rustman data set is if they’ve got three thousand drugs in Europe and you’re trying to predict how many products they’re going to sell in the next couple of weeks so one of the interesting things about this is that the test set for this is from a time period that is more recent than the training set and this is really common right if you want to predict things there’s no point predicting things that are in the middle of your training set you want to predict things in the future another interesting thing about it is the evaluation metric they provided is the root mean squared percent error so this is just a normal root mean squared error except we go actual minus prediction divided by actual so in other words it’s the percent error that we’re taking the root mean squared of so there’s a couple of interesting features always interesting to look at the leaderboard so the leaderboard the winner was 0.1 the paper that we’ve roughly replicated was point 105 106 and 10th place out of 3,000 was 0.11 ish bit less all right so we’re gonna skip over a little bit which is that the data that was provided here was they provided a small number of files but they also let competitors provide additional external data as long as they shared it with all the competitors and so in practice the data set we’re going to use contains I can’t remember six or seven tables the way that you join tables and stuff isn’t really part of a deep learning course so I’m going to skip over it and instead I’m going to refer you to introduction to machine learning for coders which will take you step-by-step through the data preparation for this we’ve provided it for you we’ve provided it for you in Russman data clean so you’ll see the whole process there and so you’ll need to run through that notebook to create these pickle files that we read here can you see this in the back okay I just want to mention one particularly interesting part of the Rossman data clean notebook which is you’ll see there’s something that says add date part and I wanted to explain what’s going on here I been mentioning for a while that we’re going to look at time series and pretty much everybody who I’ve spoken to about it has assumed that I’m going to do some kind of recurrent neural network but I’m not interestingly the kind of the main academic group that studies time series is econometrics and but they tend to study one very specific kind of time series which is where the only data you have is a sequence of time points of one thing like that’s the only thing you have is one sequence in real life that’s almost never the case normally you know if we would have some information about the store that that represents or the people that it represents we’d have metadata we’d have sequences of other things measured at similar time periods or different time periods and so most of the time I find in practice the the state-of-the-art results when it comes to competitions on kind of more real world data sets don’t tend to use recurrent neural networks but instead they tend to take the time piece which in this case it was a date we were given in the data and they add a whole bunch of metadata so in our case for example we’ve added day of week so we were given a date right we’ve had a day of week year month week of year day of month day of week day of year and then a bunch of bullying’s is at the months data and quarter year start or

end elapsed time since 1970 and so forth if you run this one function add date part and part of the date it’ll add all of these columns to your data set for you and so what that means is that let’s take a very reasonable example purchasing behavior probably changes on payday payday you might be the fifteenth of the month so if you have a thing here called this is day of month here all right then it’ll be able to recognize every time something is this is a fifteen there and associated it with a higher in this case embedding matrix value but so this way it basically the the you know we can’t expect a neural net to do all of our feature engineering for us we can expect it to kind of find nonlinearities and interactions and stuff like that but for something like taking a a date like this and figuring out that the fifteenth of the month is something when interesting things happen it’s much better if we can provide that information for it so this is a really useful function to use and once you’ve done this you can treat many kinds of time-series problems as regular tabular problems I say many kinds not all you know if there’s very complex kind of state involved in a time series such as you know equity trading or something like that this probably won’t be the case or this won’t be the only thing you need but in this case it’ll get us a really good result and it’s in practice most of the time I find this works well tabular data is normally in pandas so we just stored them as standard Python pickle files we can read them in we can take a look at the first five records and so the key thing here is that we’re trying to on a particular date for a particular store ID we want to predict the number of sales sales is the dependent variable so the first thing I’m going to show you is something called pre processes you’ve already learned about transforms transforms are bits of code that run every time something is grabbed from a data set and so it’s really good for data augmentation that we’ll learn about today which is that it’s going to get a different random value every time it’s sampled pre processes are like transforms but they’re a little bit different which is that they run once before you do any training and really importantly they run once on the training set and then any kind of State or metadata that’s created is then shared with the validation and test set let me give you an example when we’ve been doing image recognition and we’ve had a set of classes for like all the different pet breeds and they’ve been turned into numbers the thing that’s actually doing that for us is a preprocessor being created in the background so that makes sure that the classes for the training set are the same as the classes for the validation and the classes of the test set so we’re going to do something very similar here for example if we create a little small subset of the data for playing with this is a really good idea when you start with a new data set so I’ve just grabbed 2000 IDs at random okay and then I’m just going to grab a little training set in a little test set half and half of those 2000 IDs and it’s going to grab five columns okay and then we can just play around with this nice and easy so here’s the first few of those from the training set and you can see one of them is called promo interval and it has these strings and sometimes it’s missing in pandas missing is na M so the first preprocessor I’ll show you is category a and category fi does basically the same thing that that classes thing for image recognition does frow dependent variable it’s going to take these strings it’s going to find all of the possible unique values of it and it’s going to create a list of them and then it’s going to turn the strings into numbers so if I call it on my training set that’ll create categories there and then I call it on my test set passing in testicles true that makes sure it’s going to use the same categories that I had before and now when I say dot head it looks exactly the same and that’s because pandas has turned this into a categorical variable which internally is storing numbers but externally is showing me the strings but I can look inside promo interval to look at the cat categories this is all standard pandas here to show me a list of all of them what we would call classes in first day a or would be called just categories in pandas and so then if I look at the cat codes you can see here this list here is the numbers that are actually stored minus 1 minus 1 1 minus 1 1 right why don’t both one of these minus ones the minus ones

represent ni n they represent missing so pandas uses the special – one to be mean missing now as you know these are going to end up in an embedding matrix and we can’t look up item – one in an embedding matrix so internally in first AI we add one to all of these another useful preprocessor is fixed missing and so again you can call it on the data frame you can call on the test passing and testicles true and this will create for everything that’s missing anything that has a missing value it’ll create an additional column with the column name underscore na so competition distance underscore na and it will set it for true for any time that was missing and then what we do is we replace competition distance with the median for those why do we do this well because very commonly the fact that something’s missing is of itself interesting like it you know it turns out the fact that this is missing helps you predict your outcome alright so we’ve certainly want to keep that information in a convenient boolean column so that our deep learning model can use it to predict things but then we need competition distance to be a continuous variable so we can use it in the continuous variable part of our model so we can replace it with almost any number right because if it turns out that the missingness is important that can use the interaction of competition distance na and competition distance to make predictions so that’s what fixed missing does you don’t have to manually call pre-processors yourself when you call any kind of item list creative creator you can pass in a list of pre processes which you can create like this okay so if this is saying okay you want to feel missing I want to category Phi I want to normalize so for continuous variables it’ll subtract the mean and divide by the standard deviation to help a train more easily and so you just say those are my procs and then you can just pass it in there and that’s it and later on you can go data don’t export and it’ll save all the metadata for that data bunch so you can later on load it in knowing exactly what your category codes are it exactly what median values used for replacing the missing values and exactly what means and standard deviations you normalize by okay so the main thing you have to do if you want to create a data bunch of tabular data is find out or tell it what are your categorical variables and what are your continuous variables and as we discussed last week briefly your categorical variables are not just strings and things but also I include things like day of week and month and day of month even though they’re numbers I make them categorical variables because for example day of month I don’t think it’s going to have a nice smooth curve I think that the fifteenth of the month and the first of the month and the 30th of the month are probably going to have different purchasing behavior to other days of the month and so therefore if I make it a categorical variable it’s going to end up creating an embedding matrix and those different days of the month can get different behaviors so you’ve actually got to think carefully about which things should be categorical variables and on the whole if an even doubt and there are not too many levels in your category that’s called the cardinality if your cardinality is not too high I would have put it as a categorical variable you can always try an H and see which works best so our final data frame that we’re going to pass in is going to be a training set with the categorical variables and the continuous variables and the dependent variable and the date and the date we’re just going to use to create a validation set where we’re go see clear going to say the validation set is going to be the same number of Records at the end of the time period that the test set is for Cabell and so that way we should be able to validate our model nicely okay so now we can create a tabular list so this is our standard data block API that you’ve seen a few times from a data frame passing all of that information split it into valid versus train label with a dependent variable and here’s something I don’t think you’ve seen before label class dependent variable and as you can see this is this is sales it’s not a float it’s an n64 if this was a float then first day I would automatically know or guess that you want to do a regression okay but this is not a float it’s an int so first I guys going to assume you want to do a classification so when we label it we have to tell it that the class of the labels we want is a list of floats okay not a list of categories which would otherwise be the default so this

is the thing that’s going to automatically turn this into a regression problem for us and then we create a date a bunch so I wanted to remind you again about dock which is how we find out more information about this stuff in this case all of the labeling functions in the data blocks API will pass on any keywords they don’t recognize to the label class so one of the things I’ve passed in here is log and so that’s actually going to end up in float list and so if I go dock float list I can see a summary okay and I can even jump into the full documentation and it shows me here that log is something which if true it’s going to take the logarithm of my dependent variable why am i doing that so this is the thing that’s actually going to automatically take the log of my way the reason I’m doing that is because as I mentioned before the evaluation metric is root mean squared percentage error and first I’d either fastener iron or PI torch has a root mean squared percentage error loss function built in I don’t even know if such a loss function would work super well but if you want to spend the time thinking about it you’ll notice that this ratio if you first take the log of Y and Y hat then becomes a difference rather than the ratio so in other words if you take the log of Y then if this becomes root mean squared error so that’s what we’re going to do am I going to take the log of Y and then we’re just going to use root mean squared error which is the default for a regression problem so we won’t even have to mention it the reason that we have this year is because this is so common right basically any time you’re trying to predict something that’s like a population or a dollar amount of sales these kind of things tend to have long tail distributions where you care more about percentage differences and exact differences you know absolute differences so you’re very much very likely to want to do things with log equals true and to measure the root mean squared percent error we’ve learned about the Y range before which is going to use that sigmoid to help us get in the right range because this time the Y values are going to be taken the log of it first we need to make sure that the Y range we want is also the log so I’m going to take the maximum of the sales column I’m going to multiply it by a little bit so that cuz remember how we said it’s nice if your range is a bit wider than the range of the data and then we’re going to take the log and that’s going to be our maximum so then our Y range will be from zero to a bit more than the maximum so now we’ve got our data bunch we can create a tabular look learning from it and then we have to pass in our architecture and as we briefly discussed for a tabular model our architecture is literally the most basic fully connected network just like we showed in this picture it’s an import matrix multiplied non-linearity matrix multiplied non-linearity matrix model play non-linearity done okay what are the interesting things about this is that this competition is three years old but I’m not aware of any significant advances at least in terms of architecture that would cause me to choose something different to what the third-placed folks did three years ago we’re still basically using simple fully connected models for this problem now the intermediate weight matrix is going to have to go from a 1,000 activation import to a 500 activation output which means it’s going to have to be 500,000 elements in that weight matrix that’s an awful lot for a data set with only a few hundred thousand roads so this is kind of over fit and we need to make sure it doesn’t so one way to make sure it does Bob the way to make sure it doesn’t is to use regularization all right not to reduce the number of parameters to use regularization so one way to do that will be to use weight decay which first day I will use automatically and you can vary it to something other than the default if you wish it turns out in this case we’re going to want more regularization and so we’re going to pass in something called P’s this is going to provide dropout and also this one here ember drop this is going to provide embedding dropout so let’s learn about what is dropout but the short version is dropout is a kind of regularization this is the dropout paper nitish how do we say this Shrivastav oh

it was theorist Ava’s master’s thesis under Geoffrey Hinton and this picture from the original paper is a really good picture of what’s going on this first picture is a picture of a standard fully connected Network it’s a picture of this and what each line shows is a multiplication of an activation times a weight and then when you’ve got multiple arrows coming in that represents a sum so this activation here is the sum of all of these inputs times all of these activations so that’s what a normal neural fully connected neural net looks like for dropout we throw that away we’re at random we throw away some percentage of the activations not the weights right not the parameters remember there’s only two types of number in a neural net parameters also called weights plaintiff and activations so we’re going to throw away some activation so you can see that when we throw away this activation all of the things that were connected to it are gone too okay for each mini batch we throw away a different subset of activations how many do we throw away we throw them of each one away with a probability P a common value of P is 0.5 so what does that mean and you’ll see in this case not only have they deleted at random some of these in hidden layers but they’ve actually deleted some of the inputs as well deleting the inputs is pretty unusual normally we only delete activations in the hidden layers so what does this do well every time I have a mini batch going through I at random throw away some of the activations and then the next mini batch I put them back and I throw away some different ones okay so it means that it’s no one activation can kind of memorize some part of the input because that’s what happens if we over fit right if we over fit some some part of the model is basically learning to recognize a particular image rather than a feature in general or a particular item with dropout it’s going to be very hard for it to do that in fact Geoffrey Hinton described one of the kind of part of the thinking behind this as follows he said he noticed every time he went to his bank that all the tellers and staff moved around and he realized the reason for this must be that they’re trying to avoid fraud if they keep moving them around nobody can specialized so much in that one thing that they’re doing that they can figure out kind of a conspiracy the defraud the bank now of course depends when you ask Hinton at other times he says that the reason for this was because he thought about how spiking neurons work and there’s a few he’s a neuroscientist by training there’s a view that spiking neurons might help regulation and dropout is kind of a way of matching this idea of spiking your I mean it’s interesting when you actually ask people where did your idea for some some algorithm come from it basically never comes from math it always comes from intuition and kind of thinking about physical analogies and stuff like that so anyway the truth is a bunch of ideas I guess we’re all flowing around and they came up with this idea of dropout but the important thing to know is it worked really really well right and so we can use it in our models to get generalization for free now too much dropout of course is reducing the capacity of your model so it’s going to under fit and so you’ve got to play around with different dropout values for each of your layers to decide so in pretty much every fast AI learner there’s a parameter called P’s PS which will be the p-value for the dropout for each layer so you can just pass in a list or you can pass it an int and it’ll create a list with that value everywhere sometimes it’s a little different for CNN for example it actually if you pass in an int it will use that for the last layer and half that value for the earlier layers we basically try to do things or kind of represent best practice but you can always pass in your own list to get exactly the drop out that you want there is an interesting feature of drop out which is that we talk about training time and test time test time we also call inference time training time is

when we’re actually doing that those wait updates to that propagation the training time dropout works the way we just saw at test time we turn off dropout but we’re not going to do dropout anymore because we wanted to be as accurate as possible we’re not training so we can’t cause it to overfit when we’re doing inference so we remove dropout but what that means is if previously P was point OV was 0.5 then half the activations were being removed which means when they’re all there now our overall activation level is twice what it used to be and so therefore in the paper they suggest multiplying all of your weights at test time by P interestingly you can dig into the PI torch source code and you can find the actual C code where dropout is implemented and here it is and you can see what they’re doing is something quite interesting they first of all do a Bernoulli trial so a Bernoulli trial is with probability 1 minus P return the value 1 otherwise return the value 0 that’s all it means ok so in this case P is the probability of dropout so 1 minus P is a probability that we keep the activation so we end up here with either a 1 or a 0 ok and then this is interesting we divide in place remember underscore means in place in plate which we divide in place that 1 or 0 by 1 minus P if it’s a zero nothing happens is still zero if it’s so 1 and P was 0.5 that one now becomes 2 and then finally we multiply in place our input by this noise this dropout mask so in other words we actually don’t do imply torch we don’t do the change at test time we actually do the change at training time which means that you don’t have to do anything special at inference time with play torch let’s not just plate watch it’s quite a common pattern but it’s kind of nice to look inside the pipe watch source code and see you know drop out this incredibly cool incredibly valuable thing is really just these three lines of code which they do in C because I guess it ends up a bit faster when it’s all fused together but lots of libraries do it in Python and that works well as well you can even write your own drop out layer and it should give exactly the same results as this so that’d be a good exercise to try see if you can create your own drop out layer in Python and see if you can replicate the results that we get with this drop out there so that’s drop and so in this case we’re going to use a tiny bit of drop out on the first layer and a little bit of drop out on the next layer and then we’re going to use special drop out on the embedding layer now why do we do special drop out on the embedding layer so if you look inside the FASTA a source code is our tabular model you’ll see that in the section that checks that there’s some embeddings we call it embedding and then we concatenate the embeddings into a single matrix and then we call embedding drop out an embedding drop out is simply just a drop out right so it’s just an instance of a drop out module this kind of makes sense right for continuous variables that continuous variable is just in one column you wouldn’t want to do drop out on that because you’re literally deleting the existence of that whole input which is almost certainly not what you want but for an embedding and embedding is just effectively a matrix multiplied by a one hot encoded matrix so it’s just another layer so it makes perfect sense to have dropout on the output of the embedding because you’re putting drop out on those activations of that layer and so you’re basically saying let’s delete that random some of the results of that embedding some of those activations so that makes sense the other reason we do it that way is because I did very extensive experiments about a year ago where on this data set I tried lots of different ways of doing kind of everything and you can actually see it here I put all of the spreadsheet of course Microsoft Excel put them into a pivot table to summarize them all together to find out kind of which different choices and hyper parameters and architectures worked well and worked less well and then I created all these little graphs and these are like little summary training graphs for different combinations of high parameters and architectures and I found that there was one of them which ended up consistently getting a good predictive accuracy the kind of

bumpiness of the training was pretty low and you can see on it was just a nice smooth curve and so like this is an example of the kind of experiments that I do that end up in the first day I library right so embedding embedding dropout was one of those things that I just found work really well and basically these the results of these experiments is why it looks like this rather than something else well it’s a combination of these experiments but then why did I do these particular experiments well because it was very influenced by what worked well in that cattle prizewinners paper but there are quite a few parts of that paper I thought there were some other choices they could have made I wonder why they didn’t and I tried them out and found out what actually works and what doesn’t work as well and found a few little improvements so that’s the kind of experiments that you can play around with as well when you try different models and architectures different dropouts layer numbers number of activations and so forth so I’m having created a learner we can type learned up model to take a look at it and as you would expect in that there is a whole bunch of embeddings each of those embedding matrices tells you well this is the number of levels of the input for each input right and you can match these with with your list cat bars right so the first one will be store so that’s not surprising there are a thousand 116 stores and then the second number of course is the size of the embedding and that’s a number that you get to choose and so fast AI has some defaults which actually work really really well nearly all the time so I almost never changed them but when you create your tabular Lerner you can absolutely pass in an embedding size dictionary which Maps variable names to embedding sizes for anything where you want to override the defaults and then we’ve got our embedding dropout layer and then we’ve got a batch norm layer with 16 inputs okay the 16 inputs make sense because we have 16 continuous variables the length of continents is 16 so this is something for our continuous variables and specifically it’s over here the N conte on our continuous variables and BN Conte is a batch norm one d what’s that well the first short answer is it’s one of the things that I experimented with as to having batch normal not in this and I found that it worked really well and then specifically what it is is extremely unclear let me describe it to you it’s kind of a bit of regularization it’s kind of a bit of training helper it’s called batch normalization and it comes from this paper actually before I do this I was want to mention one other really funny thing dropout I mentioned it was a master’s thesis not only was it a master’s thesis one of the most influential papers of the last ten years it was rejected from the main neural Nets conference what was then called nips now called near its I think this is just it’s very interesting because it’s just a reminder that you know a our academic community is generally extremely poor at recognizing which things are going to turn out to be important generally people are looking for stuff that are in the field that they’re working on and understand so drop out kind of came out of left field it’s kind of hard to understand what’s going on and so that’s kind of interesting and so you know it’s a reminder that if you just follow you know as you kind of develop beyond being just a practitioner into actually doing your own research don’t just focus on the stuff everybody’s talking about focus on the stuff you think might be interesting because the stuff everybody is talking about generally turns out not to be very interesting the community is very poor at record high-impact papers when they come out match normalization on the other hand was immediately recognized as high-impact I definitely remember everybody talking about it in 2015 when it came out and that was because it was so obvious they showed this picture showing the current then state-of-the-art image net model inception this is how long it took them to get you know a pretty good result and then they tried the same thing with this new thing core batch norm and they just did it way way way quickly and so that was enough for pretty much everybody to

go wow this is interesting and specifically they said this thing’s called batch normalization and it’s accelerating training by reducing internal covariant shift so what is internal covariant shift well it doesn’t matter because this is one of those things where researchers came up with some intuition and some idea about this thing they wanted to try they did it it worked well they then post hoc added on some mathematical analysis to try and claim where it worked and it turned out they were totally wrong in the last two months there’s been two papers so it took three years for people to really figure this out in the last two months there’s been two papers that have shown batch normalization doesn’t reduce covariate shift at all and even if it did that has nothing to do with why it works so so I think that’s a kind of an interesting insight again you know which is like why we should be focusing on being practitioners and experimentalists and developing an intuition right what better norm does is what you see in this picture here in this paper here are steps or batches right and here is loss and here the red line is what happens when you train without batch norm very very bumpy and here the blue line is what happens when you train with batch norm not very bumpy at all what that means is you can increase your learning rate with batch norm because these big bumps represent times that you’re really at risk of your set of weights jumping off into some awful part of the weight space that it can never get out of again so if it’s less bumpy then you can train at a higher learning rate so that’s actually what’s going on and here’s what it is this is the algorithm and it’s really simple the algorithm is going to take a mini batch all right so we have a mini batch and remember this is a layer so the thing coming into it is activations okay so it’s a layer and it’s going to take in some activations and so that’s evasions it’s calling X 1 X 2 X 3 and so forth the first thing we do is we find the mean with those activations sum divided by the count let’s just the mean and the second thing we do is we find the variance of those activations difference squared divided by the mean is the variance and then we normalize so then the values minus the mean divided by the standard deviation is the normalized version okay it turns out that bits actually not that important we used to think it was okay it turns out it not the really important bit is the next bit we take those values and we add a vector of biases they call it beta here and we’ve seen that before we’ve used a bias term before okay so we’re just going to add a bias term as per usual and then we’re going to use another thing that’s a lot like a bias term but rather than adding it we’re going to multiply by it so there’s these parameters gamma and beta which are learnable parameters remember in a neural net there’s only two kinds of number activations and parameters these are parameters okay they’re things that are learnt with gradient descent this is just a normal bias layer data and this is a mode of Lickety of bias layer nobody calls it that but that’s all it is right it’s just like bias but we multiply rather than add that’s what batch norm is that’s what the layer does so why is that able to achieve this fantastic result I’m not sure anybody has exactly written this down before if they have I apologize for failing to cite ray because I haven’t seen it but let me explain what’s actually going on here the value of our predictions Y hat is some function of our various weights there could be millions of them weight 1 million and it’s also a function of course of the inputs to our layer this function here is our neural net function whatever is going on and our on there and then our loss let’s say it’s mean squared error is just our actuals minus our predicted squared okay so let’s say we’re trying to predict movie review outcomes and they’re between 1 and 5 ok and we’ve been trying to Train our model and the activations at the very end currently between minus 1 and 1 so they’re way off where they need to be the scale is off the mean is off so what can we do one thing we could do would be

to try and come up with a new set of weights that cause the spread to increase and cause the mean to increase as well but that’s going to be really hard to do because remember all these weights interact in very intricate ways right we’ve got all those nonlinearities and they all combine together so to kind of just move arc is going to require navigating through this complex landscape and we you know we use all these tricks like momentum and atom and stuff like that to help us but it still requires a lot of twiddling around to get there so that’s going to take a long time and it’s going to be bumpy but what if we did this what if we went times G Plus B we added 2 more parameter vectors or now it’s really easy right in order to increase the scale that number has a direct gradient to increase the scale to change the that number has a direct gradient to change the mean there’s no interactions or complexities it’s just straight up and down straight in and out and that’s what batch norm does right so batch norm is basically making it easier for it to do this really important thing which is to shift the app puts up and down and in and out and that’s why we end up with these results so those details in some ways don’t matter terribly the really important thing to know is you definitely want to use it right or if not it’s something like it there’s various other types of normalization around nowadays but batch norm works great the other main normalization type we use in first AI is something called weight norm which is a much more just in the last few months development okay so that’s batch norm and so what we do is we create a batch norm layer for every continuous variable n conte is the number of continuous variables in fast AI n underscore something always means the count of that thing can’t always means continuous so then here is where we use it we grab our continuous variables and we throw them through a batch norm layer and so then over here you can see it in a model one interesting thing is this momentum here this is not momentum like in optimization but this is momentum as in exponentially weighted moving average specifically this mean and standard deviation we don’t actually use a different manner standard deviation for every mini batch if we did it would vary so much did it be very hard to train so instead we take an exponentially weighted moving average of the mean and standard deviation okay and if you don’t remember what I mean by that look back at last week’s lesson to remind yourself about exponentially weighted moving averages which we implemented in Excel for the momentum and atom gradient squared terms you can vary the amount of momentum in a batch norm layer by passing a different value to the constructor in plate watch if you use a smaller number it means that the mean and standard deviation will vary less from mini batch to mini batch and that will have less of a regularization effect a larger number will mean the variation will be greater from any batch to many batch that will have more of a regularization effect so as well as this thing of training more nicely because it’s parameterised better this momentum term in the mean and standard deviation is the thing that adds a nice regularization piece when you add batch norm you should also be able to use a higher learning rate so that’s our model so then you can go ll find you can have a look and then you can go fit you can save it you can plot the losses you can fit a bit more and we end up 0.1 oh three tenths place in the competition was 0.108 so it’s looking good all right again take it with a slight grain of salt because what you actually need to do is use the real training set and submit it to cargo but you can see we’re very much you know amongst the kind of cutting-edge of models at least as of 2015 and as I say there haven’t really been any architectural improvements since then there wasn’t batch norm when this was around so the fact we added batch norm means that we should get better results and certainly more quickly and if I remember correctly in their model they had to train at a slow lower learning rate for quite a lot longer and as you can see this is about less than 45 minutes of training so that’s nice and fast

any questions in what proportion would you use dropout versus other regularization errors like weight decay L two norms etc so remember that l2 regularization and weight decay a kind of two ways of doing the same thing and we should always use the weight decay version not the l2 regularization version so there’s weight decay there’s batch norm which kind of has a regularizing effect there’s data augmentation which we’ll see soon and this drop out so that’s normally pretty much you always want so that’s easy data augmentation we’ll see in a moment so then it’s really between dropout versus weight decay I have no idea I don’t I don’t think I’ve seen anybody to fight a compelling study of how to combine those two things can you always use one instead of the other why why not I don’t think anybody has figured that out I think in practice it seems that you generally want a bit of both you pretty much always want some weight decay but you often also want a bit of dropout but honestly I don’t know why I’ve not seen anybody really explain why or how to decide so this is one of these things you have to try out and kind of get a feel for what tends to work for your kinds of problems I think the defaults that we provide in most of our learners should work pretty well in most situations but yeah definitely play around with it okay the next kind of regularization we’re going to look at is data augmentation and data augmentation is one of the least well studied types of regularization but it’s the kind that I think I’m kind of the most excited about the reason I’m kind of the most about it is that you basically there’s basically almost no cost to it you can do data augmentation and get better generalization without it taking longer to train without underfitting to an extent at least so let me explain so what we’re going to do now is we’re going to come back to computer vision and we’re going to come back to our pets data set again so let’s let’s load it in and our pets data set the images were inside the images subfolder and then a call get transforms as per usual but when we call get transforms there’s a whole long list of things that we can provide and so far we haven’t been varying that much at all but in order to really understand data augmentation I’m going to kind of ratchet up all of the defaults so there’s a parameter here for what’s the probability of an affine transform happening what’s the probability of a light lighting transfer happening so I set them both to one so they’re all gonna get transformed they’re going to be more rotation more zoom more lighting transforms and more warping what are all those mean well you should check the documentation and you do that by typing doc and there’s a dot the brief documentation but the real documentation is in dogs so I’ll click on show in Doc’s and here it is right and so this tells you what all those do but generally the most interesting parts of the Doc’s tend to be at the top where you kind of get the summaries of what’s going on and so here there’s something called list of transforms and here you can see every transform has a something showing you lots of different values of it right so here’s brightness so make sure you read these and remember these notebooks you can open up and run this code yourself and get this output all of these note all of these HTML documentation documents are auto-generated from the notebooks in the docs underscore source directory in the FASTA a repo okay so you will see the exact same cats if you try this so if I really likes cats so there’s a lot of cats in the documentation and I think you know because he’s been so awesome at creating great documentation he gets to pick the cats so so for example looking at different values of brightness what I do here is I look to see two things the first is for which of these levels of transformation is it still clear what the picture is a picture of so this is kind of getting to a point where it’s pretty unclear this

is possibly getting a little unclear the second thing I do is I look at the actual data set that I’m modeling or particularly the data set that I’ll be using as validation set and I try to get a sense of what the variation in this case in lighting is so referred like nearly all professionally taking photos I would probably want them all to be about in the middle but if the if the kind of their photos that are taken sighs some pretty amateur photographers they are likely to be some they’re very overexposed some very underexposed right so you should pick a value of the estate or augmentation for brightness that both allows the image to still be seen clearly and also represents the kind of data that you’re going to be using this to model on in practice to kind of say the same thing for contrast right it’d be unusual to have a data set with such ridiculous contrast but perhaps you do in which case you should use data augmentation up to that level but if you don’t then you shouldn’t this one called dihedral is just one that does every possible rotation and flip and so obviously most of your pictures are not going to be upside down cats that’s so you probably would say hey this doesn’t make sense I won’t use this for this data set that if you’re looking at satellite images of course you would on the other hand flip makes perfect sense so you would include that a lot of things that you can do with fast AI lets you pick a padding mode and this is what padding mode looks like you can pick zeros you can pick border which just replicates or you can pick reflection which as you can see is it’s as if the last little few pixels are in a mirror reflections nearly always better by the way I don’t know that anybody else has really studied this but we’ve we we have studied it in some depth haven’t actually written a paper about it but just enough for our own purposes to say reflection works best most of the time so that’s the default then there’s a really cool bunch of perspective whopping ones which I’ll probably show you by using symmetric walk if you look at the kind of the we’ve added black borders to this so it’s more obvious for what’s going on and as you can see what symmetric warp is doing it’s as if the camera is being moved above or to the side of the object and literally warping the whole thing like that right and so the cool thing is that as you can see each of these pictures it’s as if this cat was being taken kind of from different angles right so they’re all kind of optically sensible right and so this is a really great type of data augmentation it’s also one which I don’t know of any other library that does it or at least certainly one that does it in a way that’s both fast and keeps the image crisp as it is in first day so this is like if you’re looking to with it win a cattle competition this is the kind of thing that’s going to like get you above the people that aren’t using the first area library so having looked at all that we are going to add this have a little get data function that just does the usual gate data block stuff but we’re going to add padding mode explicitly so that we can turn on padding mode of zeros just so we can see what’s going on better fast AI has this handy little function called plot multi which is going to create a three by three grid of plots and each one will contain the result of calling this function which will receive the what coordinates and the axis and so I’m actually going to plot the exact same thing in every box but because this is a training data set it’s going to use their orientation and so you can see the same doggy using lots of different kinds of data augmentation and so you can see why this is going to work really well because these pictures all look pretty different right but we didn’t have to do any extra hand labeling or anything they’re like it’s like free extra data ok so data augmentation is really really great and one of the big opportunities for research is to figure out ways to do data augmentation in other domains so how can you do data augmentation with text data or genomic data or histopathology data or whatever right almost nobody is looking at that and to me it’s one of the biggest opportunities that could let you decrease data requirements by like five to ten x so here’s the same thing again but with reflection padding instead of zero padding and you can kind of see like see this doggies legs are actually being

reflected at the bottom here so reflection padding tends to create images that are kind of much more naturally reasonable like in the real world you don’t get black borders like this so they do seem to work better okay so because we’re going to study convolutional neural networks we are going to create a convolutional neural network you know how to create them so I’ll go ahead and create one I will fit it for a little bit I will unfreeze it I will then create a larger version of the data set 352 by 352 and fit for a little bit more and I will save it okay we have a CNN and we’re going to try and figure out what’s going on and our CNN and the way we’re going to try and figure it out is expenses specifically that we’re going to try to learn how to create this picture this is a heat map right this is a picture which shows me what part of the image did the CNN focus on when it was trying to decide what this picture is so we’re going to make this heat map from scratch when we so we’re kind of at a point now in the course where I’m assuming that if you’ve got to this point you know and you’re still here thank you then you’re interested enough that you’re prepared to kind of dig into some of these details so we’re actually going to learn how to create this heat map without almost any fast a eye stuff we’re going to use pure kind of tensor arithmetic in PI torch and we’re going to try and use that to really understand what’s going on so to warn you none of it’s rocket science but a lot of its going to look really new so don’t expect to get it the first time but expect to like listen jump into the notebook try a few things test things out look particularly at like tensor shapes and inputs and outputs to check your understanding then go back and listen again but and kind of try it a few times because you will get there right it’s just that there’s going to be a lot of new concepts because we haven’t done that much stuff in purified watch okay so what we’re going to do is going to have a seven minute break and then we’re going to come back and we’re going to learn all about the innards of a CNN so I’ll see you at 7:50 so let’s learn about convolutional neural networks you know the funny thing is it’s pretty on you you get close to the end of a course and only then look at convolutions but like when you think about it knowing actually our batch norm works so how dropout works or how convolutions work isn’t nearly as important as knowing how it all goes together and what to do with them and how to figure out how to do those things better but it’s you know we’re kind of at a point now where we want to be able to do things like that and although you know where we’re adding this functionality directly into the library so you can kind of run a function to do that you know the more you do the more you’ll find things that you want to do a little bit differently to how we do them or there’ll be something in your domain where you think like oh I could do a slight variation of that so you’re kind of getting to a point in your experience now where it helps to know how to do more stuff yourself and that means you need to understand what’s really going on behind the scenes so what’s really going on behind the scenes is that we are creating a neural network that looks a lot like this right but rather than doing a matrix multiply here and here and here we’re actually going to do instead a convolution and a convolution is just a kind of matrix multiply which has some interesting properties you should definitely check out this website certo certo / Eevee explain visually where we have stolen this beautiful animation it’s actually a JavaScript thing that you can actually play around with yourself in order to show you how convolutions work and it’s actually showing you a convolution as we move around these little red squares so here’s here’s a picture a black and white or grayscale picture right and so each 3×3 bit of this picture is this red thing moves around it shows you a different 3×3 part right it shows you over here

the values of the pixels right so in first a ice case our pixel values are between Norton one in this case there between Norton 255 right so here are nine pixel values this area’s pretty weight so they’re pretty high numbers yeah and so as we move around you can see the nine big numbers change and you can also see their colors change up here is another nine numbers and you can see those in the little X 1 X 2 X 1 here 1 2 1 now what you might see going on is as we move this little red block as these numbers change we then multiply them by the corresponding numbers up here and so let’s start using some nomenclature the thing up here we are going to call the kernel the convolutional kernel so we’re going to take each little 3×3 part of this image and we’re going to do an element-wise multiplication of each of the 9 pixels that we are mousing over with each of the 9 items in our kernel and so once we multiply each set together we can then add them all up and that is what’s shown on the right as the little bunch of red things move over there you can see there’s one red thing that appears over here the reason there’s one red thing over here is because each set of 9 after getting through the element-wise multiplication of the kernel get added together to create one output so therefore the size of this image has one pixel less on each edge than the original as you can see see how there’s black borders on it that’s because at the edge the 3×3 kernel can’t quite go any further right so the farthest you can go is to end up with a dot in the middle just off the corner ok so why are we doing this well perhaps you can see what’s happened this phase has turned into some white parts outlining the horizontal edges how well the how is just by doing this element wise multiplication of each set of nine pixels with this kernel adding them together and sticking the result in the corresponding but over here why is that creating white spots with the horizontal edges are well let’s think about it let’s look up here so if we’re just in this little bit here right then the spots above it are all pretty white so they have high numbers so the bits above it big numbers who are getting multiplied by one to one so that’s going to create a big number and the ones in the middle are all zeros so don’t care about that and then the ones underneath are all small numbers because they’re all close to zero so that really doesn’t do much at all so therefore that little set there is going to end up with bright white okay where else on the other side right down here you’ve got light pixels underneath so they’re going to get a lot of negative dark pixels on top which are very small so not much happens so therefore over here we’re going to end up with very negative so this thing where we take H 3×3 area and element wise multiply them with a kernel and add each of those up together to create one output is called a convolution that’s it that’s a convolution so that might look familiar to you right because what we did back a while ago is we looked at that Zeiler and furgus paper where we saw like each different layer and we visualized what the weights were doing and remember how the first layer was basically like finding diagonal edges and gradient that’s because that’s what a convolution can do right HML layers is just a convolution so the first layer can do nothing more than this kind of thing but the nice thing is the next layer could then take the results of this right and it could kind of combine one channel what’s a one or the output of one convolutional field is called a channel right so could take one channel that found top edges and another channel that finds left edges and then the layer above that could take those two as input and create something that finds top left corners as we saw when we looked at those Zeiler

and furgus visualizations so let’s take a look at this from another angle or quite a few other angles and we’re going to look at a fantastic post tramatic or mac line Smith who was actually a student in the first year that we did this course and he wrote this as part of his project work back then and what he’s going to show here is here’s our image it’s a three by three image and our kernel is a 2 by 2 kernel and what we’re going to do is we’re going to apply this kernel to the top left 2×2 part of this image and so the pink bit will be correspondingly multiplied by the pink bit the green by the green and so forth and they all get added up together to create this top left in the output so in other words P equals alpha times a beta times be gamma times D Delta times a there it is plus B which is a bias okay so that’s fine that’s just a normal bias so you can see how basically each of these output pixels is the result of some different linear equation that makes sense and you can see these same four weights are being moved around because this is our convolutional kernel here’s another way of looking it up from that which is here is a classic neural network view and so P now is a result of multiplying every one of these inputs by a weight and then adding them all together except the gray ones I’ve got to have a value of zero right because remember P was only connected to a B D and E a B D P so in other words remembering that this represents a matrix multiplication therefore we can represent this as a matrix multiplication so here is our list of pixels in our 3×3 image flattened out into a vector and here is a matrix vector multiplication plus by us and then a whole bunch of them we’re just going to set to zero alright so you can see here we’ve got a zero zero zero zero zero which corresponds to zero zero zero zero zero so in other words a convolution is just a matrix multiplication where two things happen some of the entries are set to zero all the time and all of the ones are the same color always have the same weight so when you’ve got multiple things with the same weight that’s called weight time okay so clearly we could implement a convolution using matrix multiplication but we don’t because it’s slow at swim practice our libraries have specific convolution functions that we use and they’re basically doing this which is this which is this equation which is says the same as this matrix multiplication and as we discussed we have to think about padding because if you have a 3×3 kernel and a 3×3 image then that can only create one pixel of output there’s only one place that this 3×3 can go so if we want to create more than one pixel of output we have to do something called padding which is to put additional numbers all around the outside so what most libraries do is that they just put a layer of zeros no layer a bunch of zeros of all around the outside so for 3×3 kernel a single zero on every edge piece here and so once you’ve pattern it like that you can now move your 3×3 kernel all the way across and give you the same output size that you started with okay now as we mention in fast AI we don’t normally necessarily use zero padding where possible we use reflection padding although for these simple convolutions we often use zero padding because it’s doesn’t matter too much in a big image it doesn’t make too much difference okay so that’s what a convolution is so a convolutional neural network wouldn’t be very interesting if it can only create top edges so we have

to take it a little bit further so if we have an input and it might be you know standard kind of red-green-blue picture alright then we can create a kernel a 3×3 kernel like so and then we could pass that kernel over all of the different pixels but if you think about it we actually don’t have a 2d and put anymore we have a 3d input a rank 3 tensor so we probably don’t want to use the same kernel values for each of red and green and blue because for example if we’re creating a green frog detector we would want more activations on the green then we would on the blue right or if we’re trying to find something that could actually find a gradient that goes from green to blue then the different kernels for each channel need to have different values in so therefore we need to create a 3 by 3 by 3 kernel ok so this is still our kernel and we’re still going to vary it across the height and the width but rather than doing an element-wise multiplication of 9 things we’re going to do an element-wise multiplication of 27 things 3 by 3 by 3 and we’re still going to then add them up into a single number so as we pass this cube over this and the kind of like a little bit that’s going to be sitting behind it right as we do that part of the convolution it’s still going to create just one number because we do an element-wise multiplication of all 27 and add them all together so we can do that across the whole padded single unit padded input and so we started with 1 2 3 4 5 by 5 so we’re going to end up with an output that’s also five by five right but now input was three channels and our output is only one channel now we’re not going to be able to do very much with just one channel because all we’ve done now is found the top edge how we’re going to find a side edge and a gradient and an area of constant white well we’re going to have to create another kernel and we’re going to have to do that conv convolved over the input and that’s going to create another 5×5 and then we can just stack those together across this there’s another axis and we can do that lots and lots of times and that’s going to give us another rank output so that’s what happens in practice right in practice we start with an input which is H ok which is H by W by 4 images 3 we pass it through a bunch of convolutional kernels but we compare to pick how many we want and it gives us back an output of and it gives us back an output of height by width by however many kernels we had and so often that might be something like 16 in the first layer and so now we’ve got 16 channels they’re called sixteen channels representing things like how much left egg edge was on this pixel how much top edge was in this pixel how much blue to red gradient was on this problem this set of 2709 pixels each with RGB and so then you can just do the same thing right you can have another bunch of kernels and that’s going to create another output ranked 310 sir again height by width by whatever might still be 16 now what we really like to do is as we get deeper in the network we actually want to have more and more channels we want to be able to find like a richer and richer set of features so that after a few as we saw in the Zeiler

and focus paper by layer four or five we’ve kind of got eyeball detectors and fur detectors and things right so we really need a lot of channels so in order to avoid our memory going out of control from time to time we create a convolution where we don’t step over every single set of three by three but instead we skip over two at a time so we would start with a three by three centered at two comma 2 and then we’d jump over to 2 comma 4 2 comma 6 2 comma 8 and so forth and that’s called a stride to convolution and so what that does is it looks exactly the same right it’s still just a bunch of kernels but we just over to it it’s time right we’re skipping every alternate input pixel and so the output from that will be H over 2 by W over 2 and so when we do that we generally create twice as many kernels so we can now have say 32 activations in each of those spots and so that’s what modern convolutional neural networks kind of tend to look like right and so we can actually see that if we go into our pets and we grab our CNN right and we’re going to take a look at this particular cat so if we go X comma y equals valid data set some index so it’s just grab the 0th or go dutch show and would print out the value of y apparently this cat is of category main code so until a week ago I was not at all familiar that there’s a cat called a Maine having spent all week with this particular cat I am now deeply familiar with this Maine so we can if we go learn summary remember that our input we asked for was 352 by 352 pixels generally speaking the very first convolution tends to have a stride too so after the first layer its 176 by 176 so this is low and summary will print out for you the output shape up to every layer 176 by 176 and the first set of convolutions is has 64 activations and we can actually see that if we type in learn belt model you can see here it’s a 2d conf with three input channels and 64 output channels and it’s tried of – okay and interestingly it actually starts with a kernel size of seven by seven so like nearly all of the convolutions are three by three so either all three by three right the reasons we’ll talk about in part two we often use a larger kernel for the very first one if you here’s a larger kernel you have to use more padding so we have to use kernel size int / 2 padding to make sure we don’t lose anything anyway so we now have 64 output channels and since it was straight – it’s now 176 by 176 and then as we go along you’ll see that from time to time we have go from 88 by 88 to 40 by 40 by 40 for the grid size for that with a 2d con and then when we do that we generally double the number of channels so we keep going through a few more calms and they’ve as you can see they’ve got batch norm and ralliers it’s kind of pretty standard and eventually we do it again another Strad to cons which again doubles yeah we can have about 5 12 by 11 by 11 and that’s basically where we finish the main part of the network we end up with 5 12 channels 11 by 11 okay so we’re actually at a point where we’re going to be able to do this heat map now so let’s try and work through it before we do I want to show you how you can do your own manual convolutions because it’s kind of fun so we’re going to start with this picture of a Maine and I’ve created a convolutional kernel and so as you can see this one has a right edge and a bottom edge with positive numbers and just inside that it’s quite negative numbers so I’m thinking this should show me

bottom-right edges okay so that’s my tensor now one complexity is that that 3×3 kernel cannot be used for this purpose because I need two more dimensions the first is I need the third dimension to say how to combine the red green and blue so what I do is I say don’t expand this is my 3×3 and I pop another three on the start what don’t expand does is it says create a 3 by 3 by 3 tensor by simply copying this one 3 times I mean honestly it doesn’t actually copy it it pretends to have copied it you know but it just basically refers to the same block of memory so it kind of copies it in a memory efficient way so this one here is now 3 copies of that and the reason for that is that I want to treat red and green and blue the same way for this little manual kernel I’m showing you and then we need one more access because rather than actually having a separate kernel like I’ve kind of printed these as if they were multiple kernels what we actually do is we use a rank 4 tensor and so the very first access is for the every separate kernel that we have so in this case I’m just going to create one kernel so to do a convolution I still have to put this unit access on the front so you can see k dot shape is now 1 comma 3 comma 3 comma 3 so it’s a 3×3 kernel there are three of them and then that’s just the one kernel that I have so it kind of takes awhile to get the feel for these higher dimensional tensors because we’re not used to writing out the for D tensor but like just think of them like this so for D tensor is just a bunch of 3d tensors sitting on top of each other ok so this is our four D tensor and then you can just call con 2 D passing in some image and so the image I’m going to use is the first part of my validation data set and the kernel there’s one more trick which is that M pi torch pretty much everything is expecting to work on a mini-batch not on an individual thing okay so in our case we have to create amen batch of size 1 so our original image is 3 channels by 352 by 352 hoped by width let’s remember paid Watchers channel by height by width I want to create a mini design I need to create a rank 4 tensor where the first axis is 1 in other words it’s a mini batch of size 1 because that’s what pipe watch expects so there’s something you can do in both pi George and numpy which is you can index into an array or a tensor with a special value none and that creates a new unit axis in that point point so T is my image of dimensions 3 by 3 52 by 352 T none is a rank 4 tensor a mini batch of one image of 1 by 3 by 3 52 by 352 and so now I can go according to D and get back ok specifically my Maine ok so that’s how you can play around with convolutions yourself so how we going to do this to create a heat map this is where things get fun remember what I mentioned was that I basically have like my input red green blue and it goes through a bunch of convolutional layers let us write a little line to say a convolutional layer to create activations which have more and more channels and eventually less and let smaller and smaller height by widths until eventually remember we looked at the summary we ended up with something which was 11 by 11 by 512 and there’s a hope there’s a whole bunch more layers

that we skipped over now there are 37 classes because remember dado dot c is the number of classes we have and we can see that at the end here we end up with 37 features in our model so that means that we end up with a probability for every one of the 37 breeds of cat and dog so it’s a vector of length 37 that’s our final output that we need because that’s what we’re going to compare implicitly to our one hot encoded matrix which will have a 1 in the location for Maine yeah so somehow we need to get from this 11 by 11 by 512 to this 37 and so the way we do it is we actually take the average of every one of these 11 by 11 faces we just take the mean so we’re going to take the mean of this first face take the mean that gives us one value and then we’ll take the second of the 512 faces and take that mean and that’ll give us one more value that so we’ll do that for every face and that will give us a 512 long vector okay and so now all we need to do is pop that through a single matrix multiply of 512 by 37 and that’s going to give a an output vector of length 37 okay so this step here where we take the average of each face is called average pooling so let’s go back to our model and take a look there it is here is our final 512 and here is we will talk about what a concat pooling is in part two for now we’ll just focus on this this is a fast AI specialty everybody else just does this average pool average pool duty with an output size of one so here it is output average pool 2d with an output size of one and then again there’s a bit of a special faster I think that we actually have two layers here but normally people then just have the one linear layer with the input of 512 and the output of 37 okay so what that means is that this little box over here where we want a one for Maine we’ve got to have a box over here which needs to have a high value in that place so that the loss will be low so if we’re going to have a high value there the only way to get it is with this matrix multiplication is that it’s going to represent a simple weighted linear combination of all of the 512 values here so if we’re going to be able to say I’m pretty confident this is a main cone just by taking the weighted sum of a bunch of inputs those inputs are going to have to represent features like how fluffy is it what color is it snows how long as its legs how point here its ears you know all the kinds of things that can be used because for the other thing which figures out is this a bulldog it’s going to use exactly the same kind of 512 inputs with a different set of weights because that’s all a matrix multiplication is right it’s just a bunch of weighted sums a different weighted sum Chaplin okay so therefore we know that this you know potentially dozens or even hundreds of layers of convolutions must have eventually come up with an 11 by 11 face for each of these features saying in this little bit here how much is that part of the image like a pointy ear how much is it fluffy how much is it like a long leg how much is it like a very red nodes all right so that’s what all of those things must represent so each face is what we call each of these represents a different feature okay so the outputs of these we can think of as different features so what we really want to know then is not so much what’s the average across the 11 by 11 to get this set of

outputs but what we really want to know is what’s in each of these 11 by 11 spots so what if instead of averaging across the 11 by 11 let’s instead average across the 512 if we average across the 512 that’s going to give us a single 11 by 11 matrix and each item each each grid point in that 11 by 11 matrix will be the average of how activated was that area when it came to figuring out that this was a Maine how many signs of Maine ish nurse was there in that part of the 11 by 11 grid and so that’s actually what we do to create our heat map so I think maybe the easiest way is to kind of work backwards here’s our heat map and it comes from something called average activations and it’s just a little bit of matplotlib and fast a fast AI to show the image and then plot lib to take the heat map which we passed in which was called average activations hm for heat map alpha 0.6 means make it a bit transparent and matplotlib extent means expand it from 11 by 11 to 352 by 352 he is by linear interpolations it’s not all blocky and use a different color map to kind of highlight things that’s just some a plot lab is not important the key thing here is that average activations is the 11 by 11 matrix we want to it here it is average activations touch shape is 11 by 11 so to get there we took the mean of activations across dimension 0 which is what I just said in PI torch the channel dimension is the first dimension so the main across dimension 0 took us from something of size 512 by 11 by 11 as promised so something of 11 by 11 so therefore activations axe contains the activations we’re averaging where did they come from they came from something called a hook so a hook is a really cool more advanced PI torch feature that lets you as the name suggests hook into the PI torch machinery itself and run any arbitrarily python code you want to it’s a really amazing and nifty thing because you know normally when we do a forward pass through a PI torch module it gives us this set of outputs but we know that in the process it’s calculated these so why would I what I would like to do is I would like to hook into that forward pass and tell PI torch hey when you calculate this can you store it for me please ok so what is this this is the output of the convolutional part of the model so the convolutional part of the model which is everything before the average pool is basically all of that but and so thinking back to try for learning right you remember with transfer learning we actually cut off everything after the convolutional part of the model and replaced it with our own little bit right so with fast AI the original convolutional part of the model is always going to be the first thing in the model and specifically it’s always going to be called assuming so in this case I’m taking my model and I’m just going to call it M right so you can see M is this big thing but always at least in faster I always m0 will be the convolutional part of the model so in this case we created a let’s go back and see we created a resin at 34 so the the main part of the resin at 34 though the pre-trained bit we hold on to is in M 0 and so this is basically it this is a printout of the rezident 84 and at the end of it there is the 512 activations so what in other words what we want to do is we want to grab em 0 and we want to hook its output so this is a really useful thing to be able to do so fast D is actually created something to do it for you which is literally you say hook output and you pass in the PI torch module that you want to hook the output of and so most of the most likely the thing you want to

hook is the convolutional part of the model and it’s always going to be m0 or learn model 0 so we give that hook a name don’t worry about this part we’ll learn about it next week so having hooked the airport we now need to actually do the forward pass all right and so remember in PI torch to actually get it to calculate something which is called doing the forward pass you just act as if the model is a function right so we just pass in our X our X mini-batch so we already had a Maine image called X right but we can’t quite pass that into our model it has to be normalized and turned into a mini batch and put on to the GPU so fast AI has the thing quite a data bunch which we have in data and you can always say data dot one item to create a mini batch with one thing in it okay and as an exercise at home you could try to create a mini batch without using data dot one item so make sure that you kind of learn how to normalize and stuff yourself if you want to but this is how you can create a mini batch with just one thing in it and then I can pop that onto the GPU by saying CUDA and that’s what I pass to my model and so the predictions I get out actually don’t care about that because the predictions is the predictions is this thing which is not what I want right so I’m not actually going to do anything with the predictions the thing I care about is the hook that it is created now one thing to be aware of is that when you hook something in PI torch that means every single time you run that model assuming you’re hooking outputs it’s storing those outputs and so you want to remove the hook when you’ve got what you want because otherwise if you use the model again it’s going to keep hooking more and more outputs which will be slow and memory intensive so we’ve created this thing a phone calls of the context manager you can use any hook as a context manager at the end of that with block it’ll remove the hook okay so we’ve got our hook and so now PI torch hooks sorry fast AI hooks always give you something called or at least the output hooks always give you something called dot stored which is where it stores away the thing you asked you to hook and so that’s where the activations now uh okay so we did a forward pass after hooking the output of the convolutional section of the model we grabbed what it’s stored we check the shape it was 512 by 11 by 11 as we predicted we then took the mean of the channel axis to get an 11 by 11 tensor and then if we look at that that’s our picture so there’s a lot to unpack right lot to unpack but if you take your time going through these two sections the convolution kernel section and the heatmap section of this notebook like running those lines of code and changing them around a little bit and remember the most important thing to look at is shape you might have noticed what I’m showing you these notebooks are very often print out the shape and when you look at this shape you want to be looking at how many axes are there that’s the rank of the tensor and how many things are there in each axis and try and think why right try going back to the printout of the summary try going back to the actual list of the layers and try and go back and think about the actual picture we drew and think about what’s actually going on okay so that’s a lot of technical content so what I’m going to do now is switch from technical content to something much more important unless we have some questions first okay because in the next lesson in the next lesson we’re going to be looking at generative models both text and image generative models and generative models are where you can create a new piece of text or a new image or a new video or a new sound and as you’ve probably are aware this is the area that deep learning has developed the most in in the last 12 months and we’re now at a point where we can generate realistic looking videos images audio and to some extent even text so there are many things in in this

journey which have ethical considerations but perhaps this area of generative modeling is one of the largest ones so before I got into it I wanted to specifically touch on ethics and data science most of the stuff I’m showing you actually comes from Rachel and Rachel has a really cool TEDx San Francisco talk that you can check out on YouTube and a more extensive analysis of ethical principles and biased principles in AI which you can find at this talk here and she has a playlist that you can check out we’ve already touched on an example of bias which was his gender shades study where if you remember for example lighter male skin people on IBM’s main computer vision system 99.7% accurate and darker females are some hundreds of times less accurate in terms of error so like extraordinary differences and so it’s interesting to kind of like okay it’s it’s first more important to be aware that not only can this happen technically that this can happen on a massive companies rolled out publicly available highly marketed system that hundreds of quality control people have studied and lots of people are using it it’s it’s out there in the world they all look kind of crazy right and so it’s interesting to think about why and so one of the reasons why is that the data we feed these things right we tend to use be included a lot of these datasets kind of unthinkingly right but like image net which is the basis of like a lot of the computer vision stuff we do is over half American and Great Britain right like when it comes to the countries that actually have most of the population in the world I can’t even see them here they’re somewhere in these impossibly thin lines because remember these deficits are being created almost exclusively by people in u.s. Great Britain and nowadays increasingly also China so there’s a lot of bias in the content we’re creating because of a bias in the kind of people that are creating that content even when in theory that’s being created in a very kind of neutral way but you can’t argue with the data right it’s it’s obviously not neutral at all and so when you have biased data creating biased algorithms you then need to say like or what are we doing with that so we’ve been spend a lot of time talking about image recognition so a couple of years ago this company deep Lin advertised their image recognition system which can be used to do mass surveillance on large crowds of people find any person passing through who is a person of interest in theory and so putting aside even the question of like is it a good idea to have such a system you kind of think is it a good idea to have such a system where certain kinds of people are 300 times more likely to be misidentified and then thinking about it so this is now starting to happen in America these systems are being rolled out and so there are now systems in America that will identify a person of interest in a video and send a ping to the local police and so these systems are extremely inaccurate and extremely biased and what happens that of course is if you’re in a predominantly black neighborhood where the probability of successfully recognizing you is much lower and you’re much more likely to be surrounded by black people and so suddenly all of these black people are popping up as persons of interest or in a video of a person of interest all the people in the video are all recognized as in the vicinity as a person of interest you suddenly get all these pings going off the local police department causing the police to run down there and therefore likely to lead to a larger number of arrests which is then likely to feed back into the data being used to develop the systems so this is happening right now and so like thankfully a very small number of people are actually bothering to look into these things I mean ridiculously small but at least it’s better than nothing and so for example then one of the best ways that people get publicity is to do kind of funny

experiments like let’s try the mug shot image recognition system that’s being widely used and trade against the members of Congress and find out that there are 28 members of Congress who would have been identified by this system obviously incorrectly oh I didn’t know that okay members have black members of Congress not at all surprised to hear that Thank You Rachel we see this kind of bias and a lot of the systems we use I’m not just image recognition but text translation when you convert she as a doctor he is a nurse into Turkish you quite correctly get a gender in specific pronoun because that’s what Turkish uses you could then take that and feed it back into Turkish with your gender in specific pronoun and you will now get he as a doctor she is in this so the bias again this is in a massively widely rolled out carefully studied system and it’s not like even these kind of things like a little one-off things then gret fixed quickly these issues have been identified in Google Translate for a very long time and they’re still there and they don’t get fixed so the the kind of results of this are in my opinion quite terrifying because what’s happening is that in many countries including America where I’m speaking from now algorithms are increasingly being used for all kinds of Public Policy judicial and so forth surfaces for example there’s system called compass which is very widely used to decide who’s going to jail and it does that in a couple of ways it tells judges what sentencing guidelines they should use for particular cases and it tells them also which people the system says should be let out on bail but here’s the thing white people that keeps on saying let this person out even though they end up reoffending and vice versa it’s systematically like out by double compared to what it should be in terms of getting it wrong with white people versus black people so this is like kind of horrifying because I mean amongst other things the data that’s using in this system is literally asking people questions about things like did any of your parents ever go to jail or do any of your friends do drugs like they’re asking questions about other people who they have no control over so not only are these systems biased very systematically biased but they’re also are being done on the basis of data which is totally out of your control so this is kind of DeJoria it seems that oh yeah are your parents divorced is another question that’s being used to decide whether you go to jail or not okay so when we raise these issues kind of on Twitter or in talks or whatever there’s always a few people always white men a few people who will always say like that’s just the way the world is so it’s just reflecting what the data shows but when you actually look at it it’s not but it’s actually systematically erroneous and systematically erroneous against people of color minorities the people who are less involved in creating the systems that these products are based on sometimes this can go a really long way so for example in Myanmar there was a genocide of Thuringia people and that genocide was very heavily created by Facebook not because anybody at Facebook wanted it I mean heavens no and no a lot of people at Facebook I have a lot of friends at Facebook they’re really trying to do the right thing right they’re really trying to create a product that people like but not in a thought for enough way because when you roll out something we’re literally in Myanmar a country that most people didn’t have most but maybe half of people didn’t have electricity until very recently and you say hey you can all have free internet as long as it’s just Facebook I think carefully about what you’re doing right and then you use algorithms to feed people the stuff they will click on and of course what people click on is stuff which is controversial stuff that makes their blood boil so when they actually started asking the generals in the Myanmar army that were literally throwing babies onto bonfires they were saying we know that these are

not humans we know that they are animals because we read the news we read the internet but there because this is the that this is the stories that the algorithms are pushing at and the algorithms are pushing the stories because the algorithms are good they know how to create eyeballs how to get people watching and how can I get people clicking and again putting it Facebook said let’s cause a massive genocide in Myanmar they said let’s maximize the engagement with people in this new market on our platform right so they very successfully maximized engagement yes please it’s just it’s important to note people warned executives at Facebook how the platform was being used to incite violence as far back as 2013 2014 2015 and 2015 someone even warned executives that Facebook could be used in Myanmar in the same way that the radio broadcast used in Rwanda during the Rwandan genocide and as of 2015 Facebook only had four for contractors who spoke Burmese working for them they really did not put many resources into the issue at all even though they were getting very very alarming warnings about it so I mean why does this happen right a part of the issue is that ethics is complicated and you will not find Rachel or I telling you how to do ethics you know how do you fix this we don’t know we can just give you kind of things to think about right another part of a problem we keep hearing is it’s not my problem I’m just a researcher I am just a techie I’m just be holding a data set I’m not part of a problem I’m part of this foundation that’s far enough away that I can imagine that I’m not part of this right but you know if you’re creating image net and you want it to be successful you want lots of people to use it you want lots of people to build products on it let’s people to do research on top of it if you’re trying to create something that people are using you want them to use then please try to make it something that won’t cause massive amounts of harm and doesn’t have massive amounts of bias and it can actually come back and bite you in the ass right the Volkswagen engineer who ended up actually encoding the thing that made them systematically cheat on their diesel emissions tests on their pollution tests ended up in jail not because it was their decision to cheat on the tests but because their manager told them to rat their code and they wrote the code and therefore they were at the ones that ended up being criminally responsible and they were the ones that were jailed right so if you do in some way a shitty thing that ends up causing trouble that can absolutely come back around and get you in trouble as well sometimes it can cause huge amounts of trouble so if we go back to World War two right then this was one of the first great opportune for IBM to show off their amazing amazing tabulating system and they had a huge client in Nazi Germany and Nazi Germany used this amazing new tabulating system to encode all of the different types of Jews that they had in the country and all the different types of problem people so Jews were eight gypsies were 12 then different outcomes were coded executions were a for death in a gas chamber was six a Swiss judge ruled that IBM was actively involved facilitating the commission of these crimes against humanity right so there are absolutely plenty of examples of people building data processing technology that are directly causing deaths sometimes millions of deaths all right so we don’t want to be one of those people and so you might have thought oh you know I’m just creating some data processing software and somebody else is thinking I’m just a sales person and somebody else is thinking I’m just the biz dev person opening new markets but it all comes together right so we need to care and so one of the things we need to care about is getting humans back in the loop right and so when we pull humans out of the loop is one of the first times that trouble happens I don’t know if you remember I remember this very clearly when I first heard that Facebook was firing the human editors that were responsible for basically curating the news that ended

up on the Facebook pages and I got to say at the time I thought that’s a recipe for disaster because I’ve seen again and again that humans can be the person in the loop that can realize this isn’t right you know it’s very hard to create an algorithm that can recognize this isn’t right or else humans are very good at that and we saw that’s what happened right after Facebook fired the human editors the nature of stories on Facebook dramatically changed that and you started seeing this proliferation of see theories and the kind of the algorithms went crazy with recommending more and more controversial topics and of course that changed people’s consumption behavior causing them to one more and more controversial topics so we’re one of the really interesting places this comes in and the Cathy O’Neil who’s got a great book called reference of math distraction thank you Rachel and many others have pointed out is that what happens to algorithms is that they end up impacting people for example compass Sentencing Guidelines go to a judge now you can say the algorithms very good we I mean it encompass this case it isn’t it actually turned out to be about as bad as random because it’s a black box and all that but even if it was very good you could then say well you know the judge is getting the algorithm otherwise they’re just be getting a person people also give bad advice so what humans respond differently to algorithms it’s very common particularly for a human that is not very familiar with the technology themselves like a judge to see like oh that’s what the computer says the computer looked it up and it figured this out right it’s extremely difficult to get a non-technical audience to look at a computer recommendation and come up with a nuanced decision-making process so what we see is that algorithms are often put into place with no appeals process they’re often used to massively scale up decision making systems because they’re cheap and then the people that are using the APIs of those algorithms tend to give them more credence than they deserve because very often they’re being used by people that don’t have the technical competence to judge them themselves so great example right was here’s an example of somebody who lost their health care and they lost their health care because of an error in a new algorithm that was systematically failing to recognize that there are many people that need help with was it Alzheimer’s cerebral palsy and diabetes thanks Rachel and so this system which had this this era that was later discovered was cutting off these people from the home care that they needed so that cerebal palsey victims loan longer had the care they needed so their life was destroyed basically and so when the person that created that algorithm with the error was asked about this and were specifically said should they have found a better way to communicate the system the strengths the failures and so forth he said yeah I should probably also just under my bed that was there that was the level of interest they had and this is extremely common I hear this all the time and it’s much easier to kind of see it from afar and say okay after the problems happened I can see that that’s a really shitty thing to say but it can be very difficult when you’re kind of in the middle of it I just want to say one more thing about that example and that’s that this was a case where it was separate there was someone who created the algorithm then I think different people implemented the software and this is a note in use in over half of the 50 states and then there was also the particular policy decisions made by that state and so there this is one of those situations where nobody felt responsible because the algorithm creators like oh no it’s the policy decisions of the state that were bad you know in the state can be like oh no it’s the ones who implemented the software and so everyone’s just kind of pointing fingers and not taking responsibility and you know in some ways maybe it’s unfair but I would argue the person who is creating the data set and the person who is implementing the algorithm is the person best placed to get out there and say hey here are the things you need to be careful of and make sure that they’re a part of the implementation process so we’ve also seen this with YouTube right it’s kind of similar to what happened with Facebook and we’re now seeing we’ve heard examples of students watching the faster I courses who say hey Jeremy and Rachel watching the faster yeah courses really enjoyed them and at the end of one of

them the YouTube autoplay fed me across to a conspiracy theory and what happens is that once the system decides that you like the conspiracy theories it’s going to just feed you more and more and then what happens is that plates going just briefly you don’t you don’t even have to like conspiracy theories the goal is to get as many people hooked on conspiracy theories as possible as what the algorithms trying to do kind of whether or not you’ve expressed interest right and so the interesting thing again is I know plenty of people involved in YouTube recommendation systems none of them are wanting to promote conspiracy theories but people click on them right and people share them and what tends to happen is also people that are into conspiracy theories consume a lot more YouTube media so it actually is very good at finding a market that watches a lot of hours of YouTube and then it makes that market watch even more so this is an example of a feedback loop and the New York Times as net is now describing youtubers perhaps the most powerful radicalizing instrument of the 21st century I can tell you my friends that worked on the YouTube recommendation system did not think they were creating the most powerful radicalizing instrument of the 21st century and to be honest most of them today when I talk to them still think they’re not they think it’s all you know not all of them but a lot of them now are at the point where they just feel like they’re the victims here people are unfairly you know they don’t get it they don’t understand what we’re trying to do it’s very very difficult when you’re right out there in the heart of it so you’ve got to be thinking from read at the start what are the possible unintended consequences of what you’re working on and as the technical people involved how can you get out in front and make sure that people are aware of them and I just also need to say that in particular many of these conspiracy theories are promoting white supremacy they’re you know kind of far-right ethno-nationalism anti-science and I think you know maybe five or ten years ago I would have thought theories are more fringe thing but we’re seeing the kind of huge societal impact it can have for many people to believe these yeah and you know partly it’s you see them on YouTube all the time it starts to feel a lot more normal right so one of the things that people are doing to try to say like how to fix this problem is to explicitly get involved in talking to the people who might or will be impacted by the kind of decision making processes that you’re enabling so for example there was a really cool thing recently where literally statisticians and data scientists got together with people who had been inside the criminal system ie had gone through the the bail and sentencing process of criminals themselves and talking to the lawyers who worked with them and put them together with the data scientists and actually kind of put together a timeline of how exactly does it work and where exactly the other places that there are inputs and how do people respond to them and who’s involved this is really cool right this is the only way for you as a kind of a data product developer to actually know how your data products going to be working a really great example of a somebody who did a great job here was Evan s dollar at Meetup who said hey a lot of men are going to our tech meetups and if we use a recommendation system naively it’s gonna recommend more tech meetups to man which is going to cause more men to go to them and then when women do try to go they’ll be like oh my god there’s so many men here we’re just going to cause more men to go to the tech meetups yeah yeah so showing recommendations to men and therefore not showing them to women yes yeah so so what Evan and made up decided was to make an explicit product decision that this would not even be representing the actual true preferences of people it would be creating a runaway feedback loop so let’s explicitly stop it right before it happens and and not recommend less made ups to women and tech meetups women and more tech meetups come in and so I think that’s that’s just it’s really cool it’s like it’s saying we don’t have to be slaves to the algorithm we actually get to decide another thing that people can do to help is regulation and normally when we kind of talk about regulation there’s a natural reaction of like how do you regulate these things that’s ridiculous you can’t regulate AI but actually when you look at it again and again and this fantastic paper core data sheets for data sets has lots of examples of this there are many many examples of industries where people thought they couldn’t be regulated people thought

that’s just how it was like cars people died in cars all the time because they literally had sharp metal knobs on dashboards steering columns weren’t collapsible and all of the discussion in the community was that’s just how cars are and when people die in cars it’s because of the people but then eventually the regulations did come in and today driving is dramatically safer like dozens and dozens of times safer than it was before right so often there are things we can do through policy so to summarize we are part of the 0.3 to 0.5 percent of the world that knows how to code all right we we have a school that very few other people do not only that we now know how to code deep learning algorithms which is like the most powerful kind of code I know so I’m hoping that we can explicitly think about like at least not making the world worse and perhaps explicitly making it better right and so why is this interesting to you as an audience in particular and that’s because fast AI in particular is trying to make it easy for domain experts to use deep learning and so this picture of the goats here is an example of one of our international Fellows from a previous course who is a goat dairy farmer and told us that they were going to use deep learning on their remote Canadian Island to help study at a disease in goats that and 3 this is a great example of like a domain experts problem which nobody else even knows about let alone know that as a computer vision problem that can be solved with deep learning so in your field whatever it is you probably know a lot more now about the opportunities in your field to make it a hell of a lot better than it was before you’ll probably be to come up with all kinds of cool product ideas right maybe be able to startup or create a new product group in your company or whatever but also please be thinking about what that’s going to mean and practice and think about where can you put humans in the loop right where can you put those pressure release valves who are the people you can talk to who could be impacted who could help you understand right and get the kind of humanities folks involved to understand history and psychology and sociology and so forth so that’s our plea to you if you’ve got this far you’re definitely at a point now where you’re ready to you know make a serious impact on the world so and I hope we can make sure that that’s a positive impact see you next week [Applause]

## Recent Comments