Ian Huston – Massively Parallel Processing with Procedural Python

Just another WordPress site

Ian Huston – Massively Parallel Processing with Procedural Python

yeah i’m going to talk today about how to do massively parallel assessing with something that’s called procedural Python and we’ll get on to why that is a white called procedural Python later on but the idea is that I’m going to just show you a few little things a little demo with my quite a notebook and hopefully all run because it’s actually live talking to a database somewhere in America so and as we said I’m a data scientist at pivotal and pivotal is relatively new company is formed last year out of parts of EMC and VMware just quickly if you haven’t got these links and there’s the ipython notebooks is not github repository and then there’s a rendered version down there I’ll put these slides up on the web later on so what is pivotal well pivotal was formed out of lots of different parts of two companies as I said EMC in VMware and the idea is to try and bring together some of the sort of big data components of those two companies so there might be a lot of things you’ve actually heard of before but never really connected with pivotal and so things like pivotal labs are agile development team and things like I’d foundry our platform for Clyde independent apps and if you’re a Java programmer you’ve hopefully heard of the spring framework millions of Java programmers use it every day and then we’ve got things like our data layer so our data fabric as we call it which includes the green plum analytics database Hadoop distribution like everyone else and also things like our in-memory database system called gemfire and sequel fire so just briefly to tell you where I’m coming from with what I’m talking about later on the customers that we have are normally large enterprises with lots of data but maybe aren’t using it to the best of their abilities so they have you know terabytes to petabytes and of data structured and unstructured and they’re trying to they want to do something more with it they maybe they want to you know reduce the fraud that they’re seeing and in their banking transactions maybe they want to understand their customer better maybe they maybe they want us you know sell more clothes or understand how fashion is changing week to week and month more what they’re not able to do at the moment is get what they want to edit the data because they’re you know unlike maybe you say a smaller more agile company they’re lumbered with you know large and relatively old legacy technologies and that have high caste and pretty limited flexibility one of the main things we see when we go into a company is that the response times for doing you know some data science or interactive data analysis the response times are really low so are really long I suppose so you send out a query and it comes back you know in six hours or the next day or something about that whereas really what we want to help enable our customers do is to get down to interactive data science so you know I’m putting in a query and I’m getting something out so that I can try the next query so I can you know fail quickly and get on to the next query fast just quickly a few other things you might have heard of that pivotal is part of or it contributes to or provide support to our things like Redis RabbitMQ Grails and groovy and and in fact we now have one of the domain Hadoop committers as part of our team there in the middle with this madlib library that I’m going to talk a little bit more about and and if you want to know how to interface with Madlib through Python there’s this rapper called PI Madlib that one of my colleagues wrote and that enables you to do that madlib is an open-source machine learning library for use on large-scale massively parallel databases in fact it works with postgrads it works with green plum and now also works on some Hadoop frameworks like Impala and Hawk as well so when we go into a customer engagement what sort of technology do we have on our side well normally and because i work for pivotal we have something like the green phlegm database or our hadoop distribution and with Hulk which is a sequel on Hadoop and solution there are open-source options for all of these things so there’s a green phlegm Community Edition and H pivotal HD Community Edition as I said Madlib is a completely open source but where does Python fit into all of this well we use Python in the same way that a lot of you do maybe for exploratory analysis and I Python using you know numpy and scifi and pandas and matplotlib but we also

use this other thing which is PL pythons so this is a way of leveraging Python inside a database and so you already have your massively parallel database you know large cost a lot of money it’s very large it does lots of computation very fast we want to be able to enable you to use Python not just you know some sequel or sequel like language in there and the benefit of using PL python is that we get to use all the great packages like NLT k or scikit-learn and all those things inside the database we don’t have to bring the data back locally to a laptop to to enable those analyses that’s one of the big themes for us is and bringing code to the to the data not bringing the data to the code and you’re obviously like that’s more important the larger the data sets you’re looking at ok let’s let’s switch then to the ipod and notebook and see if that word ok so we’ve probably skipped a little bit of the intro but basically the idea is that you have a post grads database and now you think about how can I make this work better with larger amounts of data well if I had 20 machines or 100 machines I could put postcards databases on all of them and then have a master node that actually connects to all of those and tells each of those what to do so when you submit a query you submit it to the master node that parses the query figures out where the data lives on that distributed computing system and then sends the queries down to those distributed systems gathers it up together and gives you the answer so you know the simple way to think about something like the green plum database is to think of postgres living on a cluster and being controlled by some master node ok and where PL Python comes in is that peel Python is and basically allows you to write functions for your database that aren’t in sequel so in sequel databases you normally have things called user-defined functions UDF its way of writing and seek you know some query and sequel that you then wrap in a function and can call repeatedly PL python is a way of writing a Python function inside that sequel wrapper ok in fact you can do this with other languages as well there’s plr and i think there’s even PL java and there’s another thing called p lpg sequel which is post grazes a sort of iterative programming language that looks like sequel but not quite you can assign variables and those sort of things so obviously we won’t get the interactive bit here but you can see what’s going to happen and so first of all this whole thing is a bit a bit confusing because what I’m going to be doing now is I’m going to be using an iPad 2 notebook to call some sequel database using sequel pure sequel not pure sequel in the middle of it somewhere we’ll have a little bit of Python so there’s actually layers of Python sequel to python again and the way we’re able to do this is with it’s really handy little magic command called ipython sequel by Katherine Devlin and there’s it it’s on github so you can go get it but what you need to do first of all is just load load the magic command and then I’m just going to connect to my database and if you work in large databases you normally create a schema it’s just like a sandbox two people here it’s like a namespace really to keep all your namespace together ok so first of all let’s look at what i put this bigger what a normal user defined function looks like ok so obviously the syntax highlighting doesn’t quite work here but the idea is that you have some simple sequel commands create function you give it the name and you give it the type of argument to put in here an integer an int tell it what it’s going to return again and int and then in between these dollar signs these two dollar signs this is the actual body of the function so here it’s super simple let’s select two times my first argument okay so as you might imagine when you run that you get you just do because we’re in sequel you have to use a select you get select the function and some argument and the answer you get out is 20 so it works to go on to PL python is really simple basically in between the dollar signs now instead of writing sequel you want to write actual Python so here we have just you know really simple if a greater than B return a otherwise return b the outside looks quite similar you’ve gotta create function you have to give it types you have to tell it it’s going to return an integer and then at the bottom here you specify which language you’re doing this in and this is called the

language is called peel Python you the U stands for untrusted so in fact to be to install this on your database you have to have admin privileges that’s because you know with Python you can do anything you can in fact access the base file system of the computer that your database is sitting on and running that query on so you know if you’re an administrator that probably signs a few alarm bells and but that’s just the way it is unfortunate I think there’s there are efforts to try and make a trusted version but that would obviously limit the capabilities of the Python interpreter inside it so we’ve created this function PL peopie max let’s give it two integers and it returns the larger one and here we’re going to test it and again it’s just simple select from this function give it two integers and then you get the result ash if you’ve ever worked with you know sequel style databases before you maybe can start thinking about how you can make this more complex this select the function can live in a select statement so you can start doing things like and select function and then a column name and you get whatever is coming into that column from your table you can add it in you can put this inside another function like a sorry at UDF and you can also do things like returning composite types that your database will understand as rose so here we want to return you know I suppose in some way a tuple of a name and a value again you have to type them so you give them a text type and an integer type here for that means now that when we go and look at our creating a function down here we can actually return here I’ve returned a list with the name and value inside Python you know the types don’t matter so you can return just a list with name and value and then that when I do that then here sorry okay let’s just go down and look when i do that say make pair with a name here’s Ozzie moose and value say 1 the output that I get is that pair coming back to me inside the Python parts you don’t need to worry too much about what sort of container it’s in so it can be a tuple it can be a list it can also be a dictionary with the the names of the returned values as the sorry the names here name and value and in the dictionary it can in fact be an object that has attributes dot name and dot value as well so you know you can imagine doing get escape doing something inside could learn and getting some object back or getting some results set back and you know yes you have to do some extra leg work to create this type of here but once you’ve done that your database actually understands the parts of that Python object that you’re returning and is able to put them in the right place and knows the right types for them ok so we tried out this function and we got this pair back which you know in the database looks like you’re getting a result result set I suppose the next thing I want to talk about is how you can actually and well first of all you can note that you DX and PL Python functions and are allowed to have multiple function signatures so previously my function had a name and a value and down here and now we’re just just using a name and so sometimes people use this sort of maybe well I don’t know they use it to overload the function to do things like provide help you know sequel isn’t a nice programming language like Python where you can just introspect things and or you can you know find a comments inside the function or dark comments or whatever so people do silly like interesting things like having a function with no arguments gives you back the help or a function with the word he’ll finish gives you back the help documentation okay but apart from that we can also as I said leverage the power of Python packages inside these PL Python calls to do this you have to have installed the package and all its dependencies on each of the nodes of your database so you know I’d recommend doing that with some kind of package maintainer maybe Condor something like that and and the way we do it in green plume is that we have this parallel SSH kind of system where I login one login and I’m actually logged in to the whatever 10 20 30 computers all at the same time so that you know lets you install these things quite

quickly once I’ve done that once i have numpy say on my database on all this on all the nodes and all the segments of the database i can use it inside these pale Python function so again this is really simple and would say just you know inside here an umpire just make this ray and then just return the name that we were given and some value from that array and here again I’m returning this named value type up here okay so when we do that now we get back and the name on one of those elements of that array so one thing to know about to note about sequel databases the calling semantics for these functions whether they’re where they are in the Select statement are they part of the from clause and are they at the beginning actually affects had the results come back to you so before we had select make pair Horatio or whatever and we got the back the results of the I suppose as a tuple in some ways if we in fact do select star name we will actually get back the columns and you can reference those then in a you know in a further outside query or you can put them straight into a table and you could do all those sort of things so that you know the database is now understood that there’s two values name and value here and we can do whatever we want with them after that so I’ve only so far returned one result so you know one line of Santa query age I’ve got one line back of a result can I get more than that you know if I do a complicated and complicated query complicated and numerical calculation I probably get a lot of results that I want to get back and store in my database how do I do that well that the easiest way to do that is returned a set of something so that just tells the the database that in fact there’s going to be more than one query so be ready to get like however many rows are going to come at this query by the end so here again if I import numpy I can return here three different queries or three different results and the way they come back is just at if I do again select star the way they come back is at three different rows of the results that for my query okay so you know we’re actually playing within the rules of the database semantics and this allows you as I said to put these things together and put it in a sub-query save the results into a table and all those things that you would do with sequel results anyway you can do with these with these Python functions okay but you know that’s great in all we can use it to run some Python but why do we want to do this well really the reason we want to do this is to leverage the parallelization possible when you have this large distributed system so you know pivotal we we have this daily computing appliance where we have and was it sick 16 blades each with four cores so once you start leveraging that you can actually get quite large speed ups in how you do queries one of the main things you need to think about though is that your data is going to live on this database but in different places how you distribute that data is actually very important so whether you distributed by one column or another can make a large difference because if you end up doing a query which and for example joins say we have users sorry so we have customers orders and customer information if customer information in one table we’ve order information in another table and we want to join those two tables together if we keep all the customers and you know I’ll alphabetized across my system but in fact the orders are done by order number then each time I want to join those two tables together we’re going to have to go find which order refers to which customer maybe go to a different machine with all the network traffic that that entails to find the customer name and then get their information whereas if i distribute my data across the system based on for example a customer ID value so all the customers with ID values 1 to 10 are on compute node 1 10 to 20 on node 2 like that if i then distribute the orders in the same way so an order that went to customer ID one lives on the first box then doing those joints becomes much quicker that the data lives in the same place the now network traffic everything is good so one of the main things when we go to customers and we talk about their data is how they distribute

it can’t do we need to change that do we understand the kind of queries we’re going to be making so that we can make an informed decision about where the data should live in order to enable this sort of speed up with the paralyzation as quickly as possible obviously there’s you know still some speed up if the data has to move and you would you have multiple processing units but in order to keep the data in the same place and it makes sensitive sports sorry in order to speed up the calculation as much as possible it makes sense to keep the data in the same place and when you’re talking about petabytes and joining two petabyte size tables that makes a big difference so let’s see what this looks like in terms of the pale Python so this was just this is just generating some data basically I have 3 series here one a B and C the first one is just floating point values from 0 to I suppose what like 10,000 the next one is taking the sine of that and then c is just the sign except boosted by a hundred okay and in fact you can see here what i’ve done is i’ve said i want to distribute this data by the column called name so up here a isn’t is the column name and when i tell the database to distribute it by name what I’m saying is that look in the column name see all the values that are there do something probably a hash of those values and then put that data in the right place so that the hashes that are the same live on the same box so you know you have to be very careful the way you do this you don’t want to and you’ve got a petabyte size database 60 different nodes you can store it on if the column you distribute by is a gender male female a column then in fact you only have two places that you want to storage so you put all your you know to petabytes on to two of the boxes and left the other 58 or whatever was 62 empty so you have to be careful that the cardinality of the set that you’re using to the distribution is large enough that it makes full use of the number of nodes that are available to you so here in this case maybe we’ve we went got three so it’s not quite true but you know in a bigger example you’d have to do that okay so I’ve three I’ve lots of data now three million lines of something and if I just go and naively create a function say i want to just find the mean of all those numbers you know i know i could do that in sequel but let’s do it a numpy for fun what i have to do is give that m PL Python function the argument has to be an array of all the numbers that I want to use so here what I’ve said is there’s some array value array and and it’s a double precision array okay so an array of floating-point numbers it only returns one float but you give it an array of lots of floats inside this is straight forward we you know we just return the mean of the array but how I call this now I’ve slightly have to change my semantics and so if the column why is the the number we’re trying to get the mean of I want to take it out of this test data table but I don’t want to give it to the function if I called it as we were before say NP mean brackets why then what i get is the mean of one number i give the the function one number i get the mean of that number back and then do that for every single number in that table instead what we want to do is give it all the data at once so you use this function in Postgres called array AG which basically takes a column of like a result column of results puts them into an array and use that as one object and hands that to the function ok so I’ve got a mean here I just sorry I’m got this function I’m going to give it an array of all the values in the y column for my data and then we’re going to see what we get back and in fact we get back the mean of all those numbers together but really what we want to do is we want to do separate models for each of those date each of those types of data we have it’s obviously contrived example but maybe you want to do you know maybe we have customers in lots of different european countries there laughs lots of different states in the US and we want to do a and a customer clustering model for each of those countries we don’t want to do an overall model and or I think your gun was talking earlier in the in the hierarchical classification about doing a model free for each product category to do that now we just use the sequel semantics of group by which means we just put the name at the

front of the statement here and then down here we say we want to group by the name so for each value in name pull all those results together the order by just means we get the results in the in sort of alphabetical order here and then we do the function and here why now isn’t the results for the whole table it’s the Y values for each group labeled by name so we had a B and C so now we’re going to what we’re going to do is we’re going to give all the Y values that correspond to name a give that to the function get something back then do it for B then do it for C and in this way you’ve not usually paralyzed your problem very quickly because all the data as I said before we distributed it lives on different boxes those boxes can go ahead and do the calculation and give us back the results and what you get back is something like this and in fact you know this shows us what we were our contrived example earlier one of them has mean 5,000 that’s the one that just goes up in step this is the the sign around zero again so the mean should be around zero yeah that’s okay and this one is the one that is boosted by a hundred so again the results okay so you can imagine how to how to do this for more complex examples and maybe we want to do you know machine learning models so we want to use scikit-learn or we want to do something like that and we want to do it in all these different categories all at the same time so let’s do something more complicated here I’m just going to do a linear regression okay not very much more complicated we’re going to do a linear regression on those arrays of numbers so you know we’re now giving it two arguments the x and the y array you’re going to return a float of arrays sorry if an array of floats and we’re just going to use side pie starts module and you know simple then regress and see what happens so again if I do it for all the data I get back a linear regression model that you know worked across all the data and gave me back here we have the the slope and the intercept and the sum of the errors but I know that I want to do this model not as I said not for all my customers across Europe but I want to have an individual model for each country or I want to have an individual model for each product line or I want to have an individual model for each type of bank transaction something like that can i leverage the power of the parallelization to do all those models in parallel and yes I can so as we had before is simply used the group by semantics see if name function and then group by down here and I will go off run your Python syfy linear regression on each of those arrays and what you get back as the results are the three results for the linear regression in the first case yeah well we actually yet it is pretty much a straight line with this slope going up in the second case it’s picked aged and you know the horizontal horizontal line that goes through the mean of all the the sine functions and here we can see there’s actually picked out for the third case the y-intercept is 100 as we had put in in our data okay so that was really the end of my little demo hopefully we’ve used up enough time that I don’t need to try and get my slides working for the rest and if you want to see what I was supposed to talk about it from my powerpoint and come up and talk to me afterwards or lick connected me on Twitter it’s just at Ian Houston hate you sto n I’m going to take a few questions now so thank you yeah so the question is how does this affect the execution plan that the query analyzer in the database comes up with yeah yeah exactly so the if you have a you know a large database of an execution or a query planner that looks at the career you’re trying to do and tries to figure out the best way to do it it does not know about Python and it knows that there’s some function that has to call it doesn’t know how long it’s going to take and you know if you think about it it can’t really unless it understood and you know all the insides and outs of Python and all the packages and numpy etc so no it does the query plan doesn’t understand the curry analyzer doesn’t understand the costs associated with doing these things you know I think basically you’re given the power you know you’ve got a full Python interpreter inside each node of your database now you could do anything you could start writing files to disc you could do all that sort of stuff it’s given you a lot of power and with that

comes a lot of responsibility so would it like once you’ve done it once or twice were to try and collect i’m not sure i can check i don’t think so i think it kind of thinks of it as a black box and you know you can imagine there being difficulties because it’s not concrete what how what the runtime is going to be each time you call this function like sometimes it might be very quick sometimes it might be very slow depending excuse me depending on the inputs depending on what’s actually happening inside so I’ve a feeling it just it just kind of takes a step back and says you know you’re responsible for running this query and knowing how long it’s going to take it was good question yeah movie execution today’s about throughput so I suppose yes you are you’re going to have to have some way of handling those and you know there is a and a queuing system for clients connecting to the database basically everything’s handled by the masternodes the master node no dictates to everyone who’s trying to connect what’s going to run and you can have a proper scheduling system with higher priority queries versus lower priority queries so obviously you know you want some queries to be run immediately some queries or maybe you want to slow down the way you know maybe taking a broader view the way we think about and doing analytics is that you should try and have an analytics environment that enables you to do interactive queries and it probably shouldn’t be your production batch environment that is doing you know the payroll at the end of the week or whatever it is you don’t want your data scientists to think he can’t run a query because he’s worried that he’s going to stop you know the pay slips being delivered the next day and so yeah it’s an important thing to figure out I think and sort of cost benefits you get from and being able to bring the code to data mean you know speed up just the process for data analysis and data science so much that actually you know instead of trying to bring all the things to your laptop which obviously will crash at some point and I think the speed up we get is worth the sometimes you know sometimes there’s a bit of chat on the email list about you know who’s running that query why did you let it keep going please can you cancel this like you know it’s run amok can you can you shut it down what sort of thing yeah yeah and just start of course so depending on your machine you you’re setting your laptop and you type my place in pile up you might be coming to one another three four five before it starts that now obviously the goal of your no analytics platform as you say for sort of sound boxing is true interactivity type of query that comes straight back is there a way to sort of get a playful interpreter to be sitting there with all our stuff already loaded and then when it gets the code it just goes I already know an umpire you know you know scikit-learn my over by the libraries are because that seems like that should be possible right it should be possible to have it sitting there and ready to go with with the stuff loaded if you tell him ahead of time I don’t know how you did it yeah so the question was can I have a Python environment that’s ready to go and load it up with all the libraries whatever and I suppose inside the database on each of the segments and yes that brings up a good point each time you run this function you’re actually creating an interpreter that has to load everything and you know if there are loading costs associated then they factored in yeah yeah exactly so I think you know this way of doing is that’s not the way it’s set up and you know each time it runs one of these procedural functions whether it’s Python or are a job or whatever it spins out a process to run the associated interpreter it doesn’t have a long-running interpreter in the back end you know obviously you know there’s a different paradigm the way of thinking about it you could just have a compute cluster and you know using something like I pison cluster where you have all the iphones colonel sitting out on your cluster waiting waiting for you to send it but then again you’re sending then you’re sending data to unless you’ve you know managed to make the data live in the right place yeah in one of the problems is running district systems is debugging that remotely what is the debug support line for increase max or interacting and possible broken query this is a very good question it’s probably not as good as you’d have you know just running something locally I think you do get a

stack trace yeah you like so if something goes wrong I think you do get a stack trace back now you if something goes wrong in all the nodes you’ll get you know 64 stack traces back and so being able to work through that sequel is not known for it you know in what’s the word I’m looking for informative error error logs so you know sometimes you might just get back something’s wrong here somewhere and it takes you know really it’s wrong at the first place that actually you know caught the syntax error whatever yet you know I don’t think we have a great way of doing it yeah it’s something we kind of need to work on yeah so i haven’t used them recently i think there is a way it definitely in green plum database there is a way of defining triggers and yeah i maybe come talk to me afterwards and i can i can look it up and it’s not something i’ve used recently