What Cloud AI can do for you (Google Cloud AI Huddle)

Just another WordPress site

What Cloud AI can do for you (Google Cloud AI Huddle)

[MUSIC PLAYING] PUNEITH KAUL: I would like to introduce Robbie Robbie’s a tech lead on the Google Cloud AI Platform, and one of the founding members of the team He built the first [INAUDIBLE] back-end of Cloud Machine Learning Engine service, and currently he’s focused on ML prediction and related services, likes to work on other areas of machine learning and Cloud Platform Before his time at Google, he studied statistical natural language processing, culminating in a dissertation in the area of machine learning known as active learning Robbie, in his outside work time, likes to spend time with his kids He has five lovely kids, and he’s very passionate about that I’ve known Robbie for the last 2 and 1/2 years, and you couldn’t have been in more better hands than Robbie So Robbie, please take it away [APPLAUSE] ROBBIE HAERTEL: All right Thanks for that introduction, Puneith My name’s Robbie Haertel, like he said, and I’m really excited to be here Just by way of– I know you’re going to fill out– we keep saying, hey, we want feedback Blah, blah, blah, blah, blah Couple of questions along that line Just informally, how many here would consider their role to be a data scientist? Their primary role at their company That’s a good number, but might not even be half, or just over half What about data engineer? Do you consider that? Some of the same hands went up If you’d told me before, I would’ve chosen the other one, right? You got some data engineers What about just like software engineer? So that’s a good lot of people So those that are software engineers, are you here because like machine learning is maybe a hobby or something that you’re looking to get into so you can go down that path in your career? Is that why you’re here? Anyone here for other reasons? Just because of pizza? Is someone here for the pizza? That’s why I came They said, you can have free pizza, but you got to give a talk I said, worth it! So here we are AUDIENCE: I’m doing a startup ROBBIE HAERTEL: You’re doing a startup? And so you’re here to figure out what’s on offer for that, right? AUDIENCE: Yes ROBBIE HAERTEL: Very cool AUDIENCE: And can you show us how to implement a Google Duplex? [LAUGHTER] ROBBIE HAERTEL: Well, speaking of that, let’s set expectations So as was mentioned, I’m one of the leads on the Cloud AI Platform team, so our job is to provide a platform for you guys to do machine learning So if you came here thinking that I’m going to show you how to train really fancy models to get the highest accuracy, and here’s the tricks you do in sklearn, and here’s the hyper parameters you tweak in XGBoost, you’re probably not at the right talk I hope you’ll stay, anyways Hope you’ll stay, anyways But what I’m here to show you is what the Cloud AI– are you seeing like all my emails come up and stuff? What the Cloud AI Platform can do for you folks in your roles, either as a hobbyist, trying to learn, as a data scientist trying to get your job done, as a startup I think Cloud Off has a lot of benefits for startups, actually How many here would consider yourself to be working for a startup? I’ll raise my hand You know, like what? Google’s not a startup But you know, Cloud AI Platform– this is a relatively new area I know that they have existed Google, actually– I worked for an internship at Google for what was known as the Prediction API We recently turned that down But that was one of the first public cloud, machine learning offerings And I was actually sad to see it turned down, not just because I worked on it I think it was actually ahead of its time, quite a bit And it’s still probably ahead of its time We still probably have a few more years to go until we can really realize the promise of here’s a CSV, here’s your trained model, go for it But anyways, thank you for entertaining that Thank you, Puneith, for setting this up I’d like to think that I was capable of doing that myself As I begin here, a special thanks to the people that helped me with the slides and the demos that are here I couldn’t have put it together without the following people– Noah Negrey, Younghee Kwon, Bhupesh Chandra, Kathy Tang– who’s in the audience right now– and the rest of the Cloud AI Platform team So I really appreciate all the help they gave me with the demos and stuff And then of course, special thanks to those that organized this meeting, and special thanks to all of you that are attending Back to the topic of providing feedback You know, I think Puneith mentioned that Cloud AI Platform started about three years ago So we do have some experience with working with users, with getting their needs and requirements and trying to understand those That said, we believe that we’re at the very beginning stages, and that there’s so much more that we can and should do And in fact, we believe that we’re in a hypothesis-testing phase And what we would like to do is– we’ve made certain assumptions And as we go through this talk with you, if you notice some assumptions that we’ve made that are wrong, I’d like to come up to one of us afterwards and say, hey, what about this? Did you think of this way? Or hey, you know what? Our requirements are actually a little bit different Have you thought about providing this? We’d really like to understand those things

because if we build a platform that you can’t use, I’ve just wasted the last three years of my life And I’d like to think that I was trying to do something useful And even if I failed the last three years, I’d like to continue on this path of building something that you guys can use So the topic of the talk, as has been mentioned, is– well, I kind of changed the topic slightly But it’s Production scikit-learn and XGBoost, specifically in the cloud We’ll talk just briefly about off-the-cloud use cases But you guys, most of you here, about half of you said you’re data scientists Some of you are hobbyists or did ML in grad school, so you’re very familiar with this Is my mic still on? I feel like it went off We’re still good? OK You guys could probably preach to me about this This is my view I won’t spend a lot of time on it because I believe you’re all familiar with it The model development cycle is more or less like this You have a problem to solve You collect data that will help you solve that problem You almost always have to clean the data Then you decide what model you want to use If it’s a image classification or some sort of perception problem, audio, maybe some text, you might use a neural network But if it’s a clickthrough rate prediction, you might use logistic regression If it’s structured learning, maybe you’ll use gradient-boosted decision tree, something to that effect Those of you that have been in the field long enough kind of already know this model works well with this type of problem Then you need me to go through and take this raw data that you’ve cleaned, and analyzed, and collected, and you need to do some feature engineering Especially if it’s logistic regression or some sort of linear model, you’re gonna need possibly quantization, one-hot encoding You’re going to need to do feature crosses, all sorts of logs of certain types of features, all those types of things And once you’ve done that, you can now take your data set and you can train a model And most models have hyper-parameters And your goal is not to get a model, it’s to get the best possible model given the model selection and the feature set So you’ll train that You’ll do some hyper-parameter tuning, and at the end of the day, you have a model And you’ll take that model and you’ll analyze the results Hey, what’s this model doing well? Where does it succeed? Where does it fail? And what can I do to get a better model out of it? And depending on what the errors are that you see, you may decide, ah, I didn’t have enough data Or my data wasn’t clean enough Or I need to add more features or different features Or maybe I need a different model class altogether Anyway, you go through this feature engineering cycle or model development cycle, and then what? You end up with a model, right? You have a model laying on your desk Now what? So training a model is simply a means to an end It’s not an end in and of itself We train models because we want to do something interesting The problem is– let me see how many of you can relate to this tweet “The story of enterprise Machine Learning– ‘It took me 3 weeks to develop the model.” That’s not too bad That’s probably about right She’s probably skipping how long it took to get the data ready, but we’ll spot her that one “‘It’s been over 11 months, and it’s still not deployed.'” How many of you guys can relate to this? Maybe instead of 11 months, it’s six months Maybe it’s 20 months But at any rate, ignoring the data part– because we all know that that’s hard, very hard oftentimes– but getting the model into production actually takes often much longer than training the model And how many of you think that the fun part is getting a model into production? My guess is that the reason you’re here is because the fun part is making the model, right? This is a screenshot and it’s very hard to see, of a Jupyter notebook What if you could take your code that you ran to train the model and click a button, and you’re serving your model in production? Well I’m here to tell you, obviously, that you can And yes, I’m running a Jupyter notebook on a Pixelbook, so that’s a plug for those that want to run Chromebooks here The running example that I’m going to use this to train a clickthrough model Sorry, a clickthrough rate prediction model And I’m going to use XGBoost as the underlying learner, and I’m going to use some sklearn pipelines Again, if you came to see some secret sauce on how to do this, you’re going to be sorely, sorely disappointed because you guys are going to come back and say, why didn’t you do this? Why didn’t you do that? Why didn’t you do this? And in fact, I hope you will because these examples are going to be in GitHub Go ahead and augment them to your heart’s desire and we’ll accept reasonable submissions in that sense The training data that we’re using is the Criteo sample I think it’s 11 gigabytes, and it’s seven days worth of clickthrough data that’s publicly available It was used in a Kaggle challenge So if you look at the code here, we’re just doing very standard stuff I tried to keep the example overly simple for the sake of illustration We parse a few arguments, and just because they’ll show up again– I don’t want to spend too much time here– base directory is just sort of so you know where the logs are, and so you know where to export the model and that type of thing And then the event date is– what

I’m going to do is, given a date, I’m going to look back for seven days, use the past seven days to train a new model, and then go from there OK? This here is like cheating, I guess Max samples is 7000 Recall that I’m on a Pixelbook I’m not going to be able to train on seven days worth of data, and I’m going to emphasize that point on purpose a little bit later This little beautiful block of code is just reading in the data And you you’re like, why are you using pandas? I won’t get into those details, but there is a reason It’s to get the right data format so we can use a dict vectorizer Here’s the XGBoost regressor I’m training a XGBoost model, obviously It’s kind of like logistic regression, I suppose And then we create an sklearn pipeline where we have a dictionary vectorizer, which will turn the categorical features, do the one-hot encoding for us I call pipeline.fit, we cross our fingers that it works, and we won’t necessarily wait for it to finish, but it won’t take very long because I used so little data and all that stuff It says it’s done So with, essentially, one line of code, you save the model out I’m sure you guys are all familiar with how to pickle a model And then this command here– gsutil– it’s just a file copy command that you use to go to the cloud, because if you try to use cp, obviously it won’t know how to get it to the cloud So we save the model We copied it to the cloud Now we need to actually deploy the model as a service on the cloud What we’re going to do, with these two commands here, is that’s going to create a REST API for you that you can then send prediction requests to And it does all the load balancing, it does all the auto-scaling, it has the web server, authentication, authorization, prediction, et cetera We’ll talk more about that, although I might condense that part of the slides, anyways, in a few minutes So this first line here, we have a notion in the current service called a model, which is essentially a collection of versions We’ll see how this, in the context of training new models every day, allows you to keep various versions and switch between them Like I’m serving yesterday’s model, and I now need to serve today’s model, but for the time being, think of it as a container We create this model If I do it again, I’ll get an error because they have to be unique, so I’m not going to run it Now this mess that is cURL will not be there in about a week or two We just got the command line tool updated to accept framework as a command, and it wasn’t there So it’s essentially very similar to the model’s create command, but it’s versions create So we’re saying, create a version of this model And since I bumped up the version to v4, when I run this, it’s actually going to the cloud And you can imagine, because it’s doing all the things I talked about– load balancing, web servers, blah, blah, blah, blah, blah, it does take a little while But ours is actually very fast It takes about– I think the median time is 90 seconds Your very first model may have a little bit more in it– time, and be more on the order of five minutes, but it takes about a minute and a half So while we’re doing that, let’s switch back to the slide deck Now I know we have to do a context switch for those that are engineers The software engineers– you understand that reference But the question, why is production such a long pole? Remember, that was one of the pain points that we’re addressing in this talk, is that it’s hard to get models to production I wanted to open that question up to you guys What are some of the reasons that it takes so long for models to get into production in your experience? Getting the data? Well, what about once you have the model, though? So you’ve got the data, you train the model All you really want is to show impact, right? To get a better performance review scores or something like that Or validation? Well, that’s a good point, actually Yeah And I don’t talk too much– I hint at that on one of the slides, but you’re not going to put something to production that’s junky Bring down your system Any other reasons? AUDIENCE: Performance evaluation ROBBIE HAERTEL: Performance evaluation, absolutely AUDIENCE: You said also you want to be sure that you can roll back if something went wrong ROBBIE HAERTEL: Yeah AUDIENCE: You have to prepare for that as well ROBBIE HAERTEL: Yeah You’ve got to be able to do a safe roll back Very good Back here? AUDIENCE: Managing the infrastructure [INAUDIBLE] ROBBIE HAERTEL: Yeah How many of you consider your favorite thing to do is to manage infrastructure? That’s why I have a job here, right? Doing this Exactly I may be the only one in the room Here’s a few other things You guys said some really great things I didn’t even think very hard about this, and these are the ones that came off the tip of my tongue, or fingers You need to get approvals You need to do capacity planning Some people– I actually had a conversation, I can’t recall right now who it is, but you have to convert your model from one format to another format, and then do validation on it, that type of thing There’s some technical challenges Somebody mentioned that just creating a serving system is a very complex stack

You have your load balancing You have your web server You have authentication You have authorization Oftentimes you have a cache You have logging agents You have monitoring agents They all have to work together One thing is to get all of those up Another thing is to keep all of them up, right? Because if any one of those pieces goes down, it’s your pager that goes off If you’re the production engineer And you probably don’t want that to happen And certainly, those of you that are data scientists, even if you’re capable of doing this, that’s probably the last thing on earth you’d want to do Otherwise, you would be a production engineer and not a data scientist So here’s what Cloud ML Engine Prediction offers It’s click-to-deploy It has horizontal auto-scaling, so as you send out more traffic, it scales up You send out less traffic, it scales down And it’s serverless, so it scales all the way down to zero, which is really fantastic when, say, you’re a startup, and your traffic patterns are very sparse at the beginning phases You get some traffic and you’re happy It scales down really low and you’re not so happy Or very spiky traffic We scale up very quickly Our servers are able to scale very world-class, really We give you encryption, authentication, authorization, logging, monitoring You can organize your models with labels You can do versioning rollouts and things of that nature So let us go back here Because it’s a Jupyter notebook, you don’t get the nice spinny indicator I don’t have a nice extension to do that But we can go ahead and just start sending traffic to it So you can see that this piece of code really boils down to these four lines, which is essentially one line that I broke up because a really long line is hard to read Or two lines, really You create a connection to the service, and you call execute, and you pass it the data This example– it’s not clear from the way I’ve organized the Jupyter notebook, but it’s literally the same data structure we used for the local prediction OK That’s going to get translated, obviously, to JSON because we’ve set up a REST API to do that And you can see, if I’m lucky, in the previous round run, they were both the same I’ll keep my fingers crossed with the live demo And I get a warning about credentials, but otherwise, we get the same value I’m comparing the local prediction– pipeline.predict Recall that I have an sklearn pipeline, and I’m then calling Cloud Prediction, and grabbing that out of the cloud Very fast, low-latency response times Well, it happened to still be up, but if it had scaled down to zero, that request would have taken a little bit longer to bring up the servers But it’s very fast OK So let’s move on to the next section of the talk So that’s prediction Like I said, the premise was, you have a model We combined sklearn features like pipelines and XGBoost as the underlying model You can use any of the models that sklearn has to offer, any of the transforms that sklearn has to offer Essentially a click of a button And we’re working on a GUI– I think it’s close to being implemented– where it’s literally a click of a button You point to the model and you’re off serving in production at production scale And you don’t wear the pager, I do So let’s talk a little bit about MLOps This asterisk ops, if you will, is sort of a buzzword in the industry We’ve got DevOps You’ve got DataOps You’ve got PSYOPs You’ve got– I don’t know what other ops So of course there’s MLOps, because why not, right? What do we mean by that, though? I like this diagram If I hadn’t stolen it from a paper from some folks that work here at Google, I probably would have chosen a little different boxes, but I think it serves to illustrate the point very, very well about what MLOps are If I can direct your attention to that teeny, teeny, teeny, teeny box in the center of the screen, that’s the ML code That’s the code that you guys write as part of your job to get a model running All of the other stuff is what it takes to serve, and not just serve, but to train and use ML in a production scenario Sure, if you’re just fooling around, like for a research paper, and you got to get a more accurate model than the last guy that published a paper in this domain, you don’t need all this stuff You can do to do it in a Jupyter notebook Go use Colab, or Kaggle Kernels, or your favorite Jupyter notebook and get it done If you’re going to do something in production, there’s a lot of work that goes into it You guys are probably very familiar with the data collection phase, maybe configuration as well Obviously, feature extraction in a production system that has to be running– it’s like a service that’s running on a continual basis We have your serving infrastructure, which we already went into slightly more details about how complex that can be You’ve got to have resources monitoring, blah, blah, blah, blah, blah There’s a lot that goes into serving in production When you define the term MLOps– when you define the term DevOps, it’s merging software development processes with software operation So I actually think that that definition does not hold up for MLOps Why? Because I feel like we should be taking the ops out of ML, so that our jobs in serving production systems

is more like the small black box, and less like everything around it Let’s let the infrastructure– let’s let somebody else have the job Let’s let somebody else carry the pager, type of thing And so that’s really our goal Let me give you an overview of what I think the steps should be for MLOps– with most of the ops taken out of the MLOps The first step is to develop your model I would hope that that would be 99% of your time You’d spend the three weeks to get the best darned model you can out of your data, and then you’re ready to use it in production Of course, this is the nice GUI that I said that’s not quite there yet, but it will be in a few weeks You click a few buttons and you deploy it You can go– and I forgot to do this If we have a chance, we can go back and look at it But we have monitoring and logging for all those models that we just deployed Single-click, and it’s all there So you’re looking at it You’re looking at your latencies and you say, oh, wait Why is my latency so high? And then you say, oh I get it I didn’t have enough cores, or I didn’t have enough nodes, to handle the spikes that we had You make a few tweaks, you click a button, and you’re off and going And ideally, ideally you don’t even have to do this, but the reality is we’re not quite there yet, so you have to do a little bit of tweaking But you can even do it yourself That’s how easy it is And that’s how I think MLOps should be done So why cloud? Now I probably should’ve asked a question I don’t know how many of you– well, let’s just do it How many of your companies currently use the cloud for any of their ML? We have a few, but not many And we find that to be true I’m not here to sell you on cloud It might sound like it, and I apologize if that’s true Obviously that’s what I do for a living, but let’s just have an open conversation here The cloud isn’t necessarily for everybody I had a few conversations offline with some people But here’s where it does shine, and if you fit these needs, you ought to consider it We have world-class networking and infrastructure here at Google We really pride ourselves on that So you can count on lower latencies, faster startup times and things than you’d get by rolling your own solutions Elasticity and scaling– those are buzzwords as well, but if you think about it, if you purchase a cluster of computers to put in your own office or your own data center, you have a fixed cost And if you use more than that, you’re going to get really bad latencies when you’re serving because you don’t have enough capacity If you use less than that, you’re wasting money on servers that are sitting idle, or worse, using power And in the cloud, you’re paying for what you use You scale up, you pay for what you use You scale down and you’re not using any, you don’t pay for anything High availability Google prides itself on this, and part of that’s due to the next line, which is Google SREs SRE means site reliability engineer You may have heard of this Google hires the very best engineers that know how to run and maintain production systems And they helped design the system so that there are fewer outages to begin with, and when there are outages, it’s their pager that goes off And they know how to respond and quickly mitigate any problems that are there And that’s part of the reason why Google Cloud has high availability for its services is because of these folks here Speed of deployment You saw, in 90 seconds, we went from a model that I trained on my laptop to one that was serving at production grade And then you’ll probably see this line– maybe I overemphasized this, but I really believe that the total cost of ownership and paying for what you use is a real benefit of the cloud So what if you’re not ready yet? There’s many reasons why Maybe these use cases doesn’t fit your scenario Maybe your company’s really slow at adopting things like cloud or other technologies Or maybe you guys are just onboarding but you still have a transition time I’d like to refer you to Kubeflow, which is built by our team– the Cloud AI Platform– and what it is– I like to call it the anaconda of ML So it’s a bunch of packages that bring you ML, make ML very easy to do on Kubernetes So if you guys are already using Kubernetes or thinking about Kubernetes, where you can run it on your laptop with Minikube, where you can run it on-premise if you private cloud or on-prem servers You can run it on the cloud Amazon just released their Kubernetes Microsoft has their Kubernetes, but obviously I want you to use Google’s Kubernetes engine And there’s the hybrid case where maybe you do have a data center and it’s at capacity, but when it goes over capacity you can spill over to the cloud And of course, the cloud ML engine that I’m focusing on more or less today that we have, I’d like to think of cloud ML engine as managed Kubeflow So if you don’t want to manage this stuff yourself, you can pay for the managed service All right So I focused a lot on prediction, but the MLOps stuff is a lot more than just prediction, as we saw in the diagram So now I want to talk a little bit about non-prediction cases And there’s a lot I can say, and I can tell you, just as a teaser, expect a lot from Google in the coming months, in a lot of really cool stuff,

and that’s all I can say But when might you want to train in the cloud? These may seem obvious, but let’s take a second to talk about them In the cloud you can get really large machine types How many of you guys have 96 cores sitting under your desk? If you do, I want your machine I can’t remember what we cap out at, but I think it’s somewhere like a terabyte of RAM We have machines that have that much memory And you know what I found? I actually had an intern do this last summer I said, hey Go grab the Criteo data set Go get as much RAM as you need Get a VM with as much RAM as you need and train it, and see what happens It took three hours to train You needed a lot of RAM, hundreds of gigabytes, I think Took three hours to train and he was done He didn’t have to go figure out how to do distributed training He just had to pay for three hours of a server, a fairly reasonable price with that much RAM, shut it down when he was done, and he had a really great model because it was trained on all that data Number of cores Imagine you’re doing a hyper-parameter sweep, grid search, or something like that, and you want to do it in parallel, but you don’t want to set up a SPARC cluster, or you don’t want to maintain any of that Go get a machine with 96 cores You can run 96– well, it depends on how many cores each job’s running, but you can do a lot of stuff in parallel with 96 cores Of course, you don’t want to go get a VM and do all your work on that because you’d be paying for it the whole time It’s expensive But this is what the cloud brings to you I talked about distributed training It is non-trivial If anybody’s had to set up XGBoost to do parallel training on SPARC, you can do it There’s some sample code online to try and do it, but it’s actually non-trivial And we don’t yet support XGBoost distributed training, but for TensorFlow, it’s very simple to do distributed training If you’re using tf.estimator, you literally have no code to change You submit it to the cloud and it scales up to the number of servers you tell it to scale up to And hopefully one day we can bring something like that to XGBoost We have a hyper-parameter service called HyperTune, which is available for XGBoost and scikit-learn models And it gives you much better results than doing a grid search or random search It’s state-of-the-art algorithms, so you’ll get best-in-class performance there, and it’s all in parallel You don’t have to wait So I talked to a researcher here at Google and he says what he does is he plays around during the day on small data sets, gets things how he likes them, sends off a job at night, comes back in the morning and checks on how well his model did Keeps his fingers crossed that there was no errors halfway through, right? Can anybody relate to that? I know that’s what I did during my PhD, too Repeatability So what I mean by this is– there’s more examples, but the best example of this is with continuous training So we’re talking about training a clickthrough rate model where you get new logs, and tons of them, every single day And every single day you’re going to train another model And so you’re going to use the same code, and you’re going to use it every single day, or actually a lot of people train like every hour, so they’re training very frequently You’re not going to want to go to your Jupyter notebook and hit a button at 4:00 PM every day to train another model This is a repeatable process that needs to be productionized It needs to be robust And so the cloud is really great for this And in fact, if you’re taking advantage of those large machines and those large cores, it sends off a job It uses all those things for a short period of time It shuts it down, and you only pay for that time that you’re doing it So let’s come back here to see what that might look like in the cloud So there is a price to pay when you’re switching from your local machine to the cloud You do need to package up your code in a way that the service can understand And there are multiple ways to do that We’ll be supporting containers, which is becoming an industry-standard way to do it For now, we use Python packages So you just run python setup.py sdist You see the second command then uploads it to a bucket in the cloud And I think that’s finished And then I’ll go ahead and hit Enter, and then you can see right here all I’m doing is saying, submit a training job called CTR, for clickthrough rate, with the time stamp so it’s unique I’m saying that it’s trainer.train This package is what we just uploaded The runtime version tells us which version of– now I wish that was more obvious that this was like scikit-learn whatever version, but this does translate into scikit-learn version This is Cloud ML Engine’s runtime version 1.8 We’re going to do it in US central It’s with this project, and a few command-line parameters So just since I forgot to say this– this train.py is that exact code we saw at the beginning The only thing is you would probably take off that max sample lines that I had, because my Pixelbook couldn’t handle it, and I won’t admit how long it took me to figure out that’s what was crashing on my notebook because it didn’t tell me And you know, you might change the max depth,

or the number of estimators to be what you would have in production This is your actual production code So you make a few tweaks You send it off to the cloud You can see that the job was queued Again, we don’t have a first-class integration, so we have to go over here to our list of jobs And let’s see That one– you see the spinny thing means it’s still spinning up the machines Our prediction service actually spins up the machines very, very fast The training takes a little bit longer I think if it’s a single-machine job, it’s usually around one or two minutes So we can go to this job here, take a look at the logs Nothing super interesting, just to show you that there’s this integration here with the logs if I’m lucky enough to get it to come up So you can see that all the output’s captured in the logs in case you need to go back and figure out something that’s wrong or something that’s right And you can see that the job completed successfully So that’s how you send off a single job But the use case that I was talking about is really about productionizing this and doing it on a daily basis How many of you are familiar with Airflow, by chance? A couple of you are? Probably the data engineers, I think I mentioned to somebody before we started I’d hint at a little bit of pipeline here So what Airflow is– and there’s lots of orchestration systems, but it allows you to chain together various processes that you can then do on a repeated basis In this case, on a daily basis You can see that right here And I’ll get into what this actually means, but you can see that every day for the last two weeks, this code that we just submitted to the service to run to train a model has run on the previous seven days worth of data, produced a model, and then pushed that model to serving So let’s go ahead and you can see that in a graph view here Very, very simple DAG I did that for the sake of this demo It just does training and create version, but some of you guys have already pointed out an obvious problem with that If something went wrong during training and I pushed it to production, I’d be in a world of hurt Could even get fired for that So what you would normally do, and it’s actually not that difficult You can put another mode in the graph that runs evaluation, and another node in the graph that checks that the quality of that model is better than the previous, or the model that’s currently serving, before you push it, and then go ahead and push it to serving But for simplicity’s sake, I just trained and pushed And you can look at– the code to do that is actually very, very reasonable here Again, you have the preliminary, which binary do I use, et cetera These DAG args are just necessary for Airflow to work A little bit of overhead Copy and paste is what I like to think of it as And then here’s where you say, hey, with DAG– in other words, with my pipeline that I’m trying to define– you have this training op which looks very close to the gcloud operation, but in Python code It says hey, use this binary for training And then you have the create version op, which again looks like the gcloud option for pushing a version So I mean I don’t know how many lines of code that is, but the core of this is, what? 20 lines of code, maybe? And we’ve defined this pipeline that we’re able to run in the cloud on a daily basis that pushes the models to production Let’s see if I have this So you can see, here’s the model ctrdaily Remember how I said, model’s like a container for versions? And here are all the versions of that model that got pushed on a daily basis Now also on a production system, you’d probably be garbage-collecting these so you don’t have so many laying around You can see that this is the default one, so if I send traffic to this model, it’s going to actually go to that version You’d want to update that on a daily basis, at least after you’d validated it So yeah Any questions? I forgot to say this It was kind of difficult with all the interruptions we had My preference was going to be to have questions in-line, but I’m almost at the end of my talk now But any questions about these pipelines or the training? Yeah AUDIENCE: The logging that you showed– ROBBIE HAERTEL: Yeah AUDIENCE: Is that something that spit out of the cloud box itself, or is that something that if I had logging in my code, that I can spit out to? ROBBIE HAERTEL: Yep Yeah So if you do your print statements– your logging.info statements, if your Python programmer– those things show up here And I think it tells you which one’s which The standard error shows up unfortunately, as orange warning signs There’s nothing to be worried about That’s just XGBoost telling us that it’s doing work Any other questions? So this is really the end of my talk Just to summarize in a few points, productionization is hard Cloud is easy And you pay for what you use on the cloud All right And a few teasers, I guess, before we move to see if there’s any other questions about the entirety of the talk But our team is working on increased focus

on reducing MLOps So we saw probably 10 boxes on that page, and I addressed a few of them, like the continuous training case and the serving case But there’s actually a lot more boxes on there, and you better believe that we’re spending time making this easier and easier as time goes on Increased cost efficiencies work That’s a continuous focus for us because we’re able to amortize the cost of the infrastructure across everybody So we’re actually able to get you lower costs than you might on your own And then we’re also working to allow data scientists like yourself and even data engineers, as well, to share their reusable work with others We’re working on ways to do that [MUSIC PLAYING]