# Lecture 7: Introduction to TensorFlow

Just another WordPress site

### Lecture 7: Introduction to TensorFlow   TensorFlow mathematical operations as opposed to NumPy operations Okay, so let us actually see how this works in code So we gonna do three things We’re going to create weights including initialization We’re going to create a placeholder variable for our input x, and then we’re going to build our flow graph So how does this look in code? We’re gonna import our TensorFlow package, we’re gonna build a python variable b, that is a TensorFlow variable Taking in initial zeros of size 100 A vector of 100 values Our w is going to be a TensorFlow variable taking uniformly distributed values between -1 and 1 of shapes of 184 by 100 We’re going to create a placeholder for our input data that doesn’t take in any initial values, it just takes in a data type 32 bit floats, as well as a shape Now we’re in position to actually build our flow graph We’re going to express h as the TensorFlow ReLU, of the TensorFlow matrix multiplication of x and w, and we add b So you can actually see that the form of that line, when we build our h, essentially looks exactly the same as how it would look like a NumPy, except we’re calling on our TensorFlow mathematical operations And that is absolutely essential, because up to this point, we are not actually manipulating any data, we’re only building symbols Inside our graph No data is actually moving in through our system yet You can not print off h, and actually see the value it expresses First and foremost, because x is just a place holder, it doesn’t have any real data in it yet But, even if x wasn’t, you can not print h until we run a tune We are just building a backbone for our model But, you might wonder now, where is the graph? If you look at the slide earlier, I didn’t build a separate node for this matrix multiplication node, and a different node for add, and a different node for ReLU Well, ReLU is the h We’ve only defined one line, but I claim that we have all of these nodes in our graph So if you’re actually try to analyze what’s happening in the graph, what we’re gonna do, and there are not too many reasons for you to do this when you’re actually programming a TensorFlow operation But if I’m gonna call on my default graph, and then I call get_operations on it, I see all of the nodes in my graph and there are a lot of things going on here You can see in the top three lines that we have three separate nodes just to define what is this concept of zeroes There are no values initially assigned yet to our b, but the graph is getting ready to take in those values We see that we have all of these other nodes just to define what the random uniform distribution is And on the right column we see we have another node for Variable_1 that is probably going to be our w And then at the bottom four lines we actually see the nodes as they appear in our figure, the placeholder, the matrix multiplication, the addition and the ReLU So in fact, the figure that we’re presenting on the board is simple for what TensorFlow graphs look like There are a lot of things going behind the scenes that you don’t really need to interface with as a programmer But it is extremely important to keep in mind that this is the level of abstraction that TensorFlow is working with above the Python code This is what is actually going to be computed in your graph And it is also interesting to see that if you look at the last node, ReLU It is pointing to the same object in memory as the h variable that we defined above Both of them are operations referring to the same thing So in the code before, what this h actually stands for is the last current node in the graph that we built So great We define, question? So the question was about how we’re deciding what the values are, and the types This is purely arbitrary choice, we’re just showing an example, It’s not related to, it’s just part of our example Okay Great, so we’ve defined a graph And the next question is how do we actually run it? So the way you run graphs in TensorFlow is you deploy it in something called a session A session is a binding to a particular execution context like a CPU or a GPU So we’re going to take the graph that we built, and we’re going to deploy it on to a CPU or a GPU And you might actually be interested to know that Google is developing their own integrated circuit called a tensor processing unit, just to make tensor computation extremely quickly It’s in fact orders of magnitude more quick then even a GPU, and they did use a tender alpha go match against lissdell So the session is any like hardware environment that supports the execution of all the operations in your graph So that’s how you deploy a graph, great So lets see how this is run in code We’re going to build a session object, and we’re going to call run on two arguments, fetches and feeds        And so this is going to be batch inputs And we are going to define a placeholder here But in this case, since you just have integers We can define with int32 And the shape is going to be batch_size and nothing So we have that And we can avoid naming here Because we are not going to call multiple variable scopes That will be fine Then we go and create our batch labels Which is, again, tf.placeholder And 32 This will also be of the same shape as previous And finally, we will go and create a constant for our validation set Because that is not going to change anytime soon And that is going to be defined by a val_data which we previously loaded And we have to define what type it is And the type for that is int32 again Just like our training set All right Now that we have defined, yes? So since I’ll be applying transposes later I just wanted to make sure that it’s one It doesn’t really make that big of a difference So in this case, I’ll be calling transpose on labels Which is why I just wanted to make sure that it transposes fine You wouldn’t It’s just, I wanna make it absolutely clear that it’s a row vector Not a column vector >> [Question] >> Yeah, exactly All right So now we can go and start creating our scope for All right So, this is where we’ll define our model And first, we are going to go and create an embeddings, as we all did in our assignment And that’s going to be a huge variable And it’s going to be initialized randomly with uniform distribution And this is going to kick vocabulary size, which you have previously defined in the top And it’s going to take embedding size So this is going to be a number of words in your dictionary times the size of your embedding size And we are going to define Since it’s a randomly uniform distribution We just going to also give the parameters for that So far so good? All right, so we just created our embeddings Now, since we want to index with our batch We are going to create batch embeddings And you are going to use this function Which is actually going to be pretty handy for our current assignment And so we do an embedding lookup with the embeddings And we are going to put in the batch inputs here All right Finally, we go and create our weights And we are going to call it, here, .variable, here So we are going to use truncated normal distribution here Which is just normal distribution where it’s cut off at two standard deviations instead of going up the internet Okay All right This is also going to be of the same size as previously But this is going to be vocabulary size and embedding size And this is because I turn tracks with our input directly Since this is truncated normal, we need to define what the standard deviation is And this is going be given by one over the square root of the embedding size, itself [BLANK AUDIO] Okay Finally we go and create our biases, Which are also going to be variables And this is going to be initialized with zeros of size vocabularies All right Now we define our loss function, now that we have all our variables So in this case, we used a soft max cross entropy in our assignment are the negative log likelihood In this case, you’d be using something similar And this is where Tensorflow really shines It has a lot of loss functions built in And say we are going to use this called negative constraint, negative concentrate I forgot the exact name But it is very similar, in the sense that the words that need to come up with a higher probability are emphasized And the words which should not appear with lower probability are not emphasized And so we are going to call tf.nn Nn is the neural network library in TensorFlow, our module And this is going to take a couple of parameters You can look up the API Yes? Embeddings? All right What vector presentation Which is what you’re trying to learn No, w is the weight matrix that is a parameter that you’re trying to also learn But it’s interacting with other presentations Effectively, you can think of these embeddings as sort of semantic representations of those words, right? Yes? Right So, imagine So, our embeddings is defined as the vocabulary size So let’s say we have 10,000 words in our dictionary And each row is now the word vector that goes with that word Index of that word And since our batch is only a subset of the vocabulary, we need to index into that EH matrix With our batch, which is why we used the embedding lookup function, okay All right, so we’re gonna go and just use, this API obviously everyone would need to look up on the TensorFlow website itself But what this would do is now take the weights and the biases and the labels as well Okay I defined them as batch_labels And they also take an input, which is batch_inputs Okay And so here’s where TensorFlow really shines again, the num_sampled So in our data set, we only have positive samples, or in the sense that we had the context words and the target word We also need context words and noisy words And this is where num_samples will come in use We have defined num_samples to be 64 earlier And what it would essentially do is look up 64 words which are not in our training set and which are noise words And this would serve as sort of negative examples so that our network learns which words are actually context words and which are not And finally, our num_classes is defined by our vocabulary size again All right With that, we have defined our loss function And now we have to take the mean of that, because loss needs to be the size of the batch itself And we get that by reduced mean This is gonna be slightly nasty So we get, the loss is given for that particular batch And yes, exactly It’s given for multiple samples And since we have multiple samples in a batch, we want to take the average of those Exactly Okay And, so great And now we have completely defined our loss function Now we can go ahead and actually, if you remember from the assignment, we take the norm of these word vectors So let’s go ahead and do that first So that will be reduce_mean, this is just API calling, which is very valuable and detailed on the TensorFlow website itself All right Keep_dims=True So this is where, in this, I have added an argument called keep dimensions And this is where, if you sum over a dimension, you don’t it to disappear, but just leave it as 1 Okay And now we divide the embeddings with the norm, to get the normalized_embeddings embeddings/norm Great And now we return from, we get batch inputs, we return batch labels because this will be our feed We have normalized embeddings And we have loss All right With this done, we can come back to this function later There’s a slide part missing, which we’ll get back to Yes? Thank you All right So now we go and define our run function How are we doing on time? Okay, we have 20 minutes, great We actually make a object of our model And that’s just by calling Embeddings And loss from our function, which was just called word2, or skipgram, rather Okay, and now we initialize the session And over here, again, I forgot to initialize our variables We can call We just initialized all of our variables for the default values, as Barak mentioned again Now we are gonna go and actually loop over our data to see if we can actually go ahead and train our model And so let’s actually do that first step And batch_data Train_data So for each iteration in this for loop, we are gonna obtain a batch, which has its input data as well as the labels Okay, so we have inputs and labels from our batch Great And we can now define our feed_dictionary accordingly where the batch_inputs So this is a dictionary And our batch_labels would just be labels Any questions so far? Okay We go ahead and call our loss function again And we do this by calling session.run, where we fetch the optimizer again and the loss And we pass in our feed dictionary, which we already defined above Okay We’ll get the loss And since we are trying to get the average, we’re gonna add it first and then divide by the number of examples that we just saw All right So we’re just gonna put a couple of print statements now just to make sure, to see if our model actually goes and trains And see Print loss Step [INAUDIBLE] and then average loss Since the loss will be zero in the first step, we can just [INAUDIBLE] All right, so now we have our average loss And we reset our average loss again, just so that we don’t for every iteration loop Okay So we have almost finished our implementation here However, one thing that’s, yes? I forgot to define that Good call So we can define that as the beginning of a run step Gradient Optimizer And we’ll take a learning rate of zero, and we’re gonna minimize the loss All right, thanks for that, okay One thing that we’re missing here is we haven’t really dealt with our value addition set So, although we are training in our training set, we would wanna make sure that it actually generalizes to the value addition set And that’s the last part that’s missing And we just gonna do that now But before we do that, there’s only one step missing Where we, once we have the validation set, we still need to see how similar our word vectors are with that And we do that in our flow graph itself So, let’s go back to our skip gram function Anyway here we can implement that, okay So, we have our val_embeddings against index into the embeddings matrix to get the embeddings that correspond to the validation words And we use the embedding look up function here, embedding_lookup, embedding and we call in train data set or val data set We’ll actually use the normalized_embedding because we are very concerned about the cosine similarity and not necessarily about the magnitude of the similarity Okay, and we use val_dataset here Okay, and the similarity is essentially just a cosine similarity So, how this works is we matrix multiply the val_embeddings which we just obtained and the normalized_embeddings And since they won’t work, you can just may just multiply both of them because of dimensional incompatibility, we’ll have to transpose_b And this is just another flag All right, since we also returned this from our function, again this is just a part of the graph And we need to actually execute the graph, in order to obtain values from it And here we have similarity Okay, and let’s do, since this is a pretty expensive process, computationally expensive, let’s do this only every 5,000 iterations All right So, the way you’re gonna do this is by calling eval on the similarity matrix, what this does is, since we had this noted it actually goes and evaluates This is equal on to calling session.run on similarity and fetching that, okay So, we go on call, and we get similarity, and for every word in our validation set, You gonna find the top_k words that are closest to it And we can define k to be 8 here And we will now get the nearest words So let’s do that And we’ll sort it according to their magnitude And since the first word that will be closest will be the word itself, we’ll want the other eight words and not the first eight words and this will be given by top_k+1, any questions so far? Yes? Right, so your embedding is on number of words you have in your vocabulary, times the size of your word embedding for each word So, it’s a huge matrix, and since your batch that you’re currently working with is only subset of that vocabulary, this function, embedding to lookup, actually indexes into that matrix for you and obtains the word This is the equivalent to some complicated Python splicing that you do with matrices, but it’s just good syntax should go over it Okay, all right, almost there, okay So, we have our nearest key words, we’ll just go and, I have this function in my utils, you can check this on the GitHub that we’ll post after the class is over And you can play with it as you wish, and In the past a nearest, and a reverse_dictrionary gesture actually see the words and not just numbers, all right Finally, be open our final_embeddings which will be a normalized_embedding at the end of the train and we’re just going to call eval on that again which is equal to calling session.run and passing and fetching this All right, we are done with the coding and we can actually see and visualize how this performs Okay And python word2vec Oops, I must have missed something I missed a bracket again So, we’ll first load up our data set, and then it will iterate over it and we will use our scripting model Oops, let’s see that All right, where did this group? You know why, please tell me why have to be there, okay Perfect So, as you can see here we have 30,000 batches each with a bad set of 128 Ahh, man, [LAUGH] Let’s see All right, so as we see, the loss started off as 259, all right, ends up at 145 and then decreases, I think it goes somewhere to around 6 Here we can also see as a printing, the nearest word for this is e leaders orbit, this gets better with time and with more training data We only use around 30,000 examples, so there’s a lot of potential to actually get better And in the interest of time, I’m only limited to around 30 epochs, yes So TensorFlow comes with TensorBoard, which I didn’t show in the interest of time Essentially, you can go up to your local host, and then see the entire graph, and how it’s organized And so, that’ll actually be a huge debugging help, and you can use that for your final project Enter bold, yeah All right, well, thank you for your time >> [APPLAUSE]