Kunle Olukotun – Heterogeneous Computing [2/2]

Just another WordPress site

Kunle Olukotun – Heterogeneous Computing [2/2]

thank you so yeah I just have a some of our research and then will to the hands-on session so the last thing I really want to talk about is optimizing for any move machine so for all my examples I start with a problem so this one is you look at machine learning there’s a very large class of problems or essentially they boil down to I have a bunch of data those are my white eyes and I want to minimize X according to some function where X is my essentially my output model of that data and the way you do this is called stochastic gradient descent where because there’s so many Y’s you kind of randomly sample Y you compute an approximate gradient at every step that gives you an estimate of the next you know value of the model and you repeat this and repeat this and repeat this until X stops changing so essentially what you have here is you know billions of very tiny iterations and the question is how can you do is sufficiently so if you think about how you know modern big machines look this is how they actually look under the hood like I know they tell you that it’s all a shared memory you know communication just works magically and while it does work it is not very fast so in reality if you have a multi socket machine you know every socket has some number of cores and then some piece of your total system Ram directly connected to it and with across sockets you can only communicate through qpi and then there’s this there’s a cache coherence mechanism that maintains the caches on all the sockets and and the RAM across all the sockets and the moral of the story is that when you cross sockets things get slow so how are we going to solve this problem that I just talked about oh it sorry one more thing the the height a little bit is that if you want the kind of the algorithm I showed you to run quickly there’s sort of two areas you could optimize one we call is the hardware efficiency which is sort of the time per iteration right in order to do in order to make time per iteration go down I can paralyze like I have a lot of cores working on this at the same time but in order for that to actually scale of course I need very infrequent communication my cores can’t be indicating frequently or things stop scaling then conversely there’s what we call statistical efficiency which essentially boils down to the number of iterations actually takes to converge and this unfortunately is in direct conflict with the bird when I first said because the more often you communicate the faster you converge the less often the slower you converge so now time to converge is the product of those two and I’ve just told you that if you make one go up you make the other go down so finding the actual optimal solution is very hard so so we look at some of the trade-offs we could do here I think well how are we going to run this so one we call per machine which is and this is the very simple like shared-memory approach where you have one version of your model it sits in some piece of RAM you just randomly pick which one and now every core in this machine is operating on that same model and that model is meetable so they’re all doing writes that model in parallel and the cache coherence mechanism the machine is magically making this work for you but of course the all the rights to when you write make writes and socket one that it bakes lines in socket two and vice versa and there’s a lot of hardware overhead that goes into making this work so yeah so stalls so then the other kind of extreme here is well let’s just make a separate model for every core and now of course you have no contention because every core has its own copy of the model everything is private the paralyzation is going to be beautiful there’s no synchronization even required but because the models are communic like there’s no inter model communication they’re not updating themselves you’re not going to converge very quickly so as I said the infrequent communication means slow conversions and now there’s a trade-off we can make here and say well actually you know communication within multi-core and one socket is pretty fast right we have a shared cache and it’s actually not a big deal to do conflicting right so that it doesn’t actually go very far so the sort of the hybrids approach is per socket where now we’re going to have one for all the cores within a socket but we’re gonna have separate models across sockets and then so that’s our trade-off space and we’re gonna run some apps and see what happens and this is what happens and you see unfortunately there is not one answer it depends on your app and it even can somewhat depend on your

data size so as you see this is the convergence time and for Gibbs sampling we have per core and per socket or similar but poor core is ultimately the winner SVM it’s per socket well just a regression it’s per machine so here we see is basically that you know if you want performance optimizing your algorithm really matters and now I want to show you okay well some smart people did this and they came up with the right algorithm so what kind of performance can you get if you actually write this in code and so we compare here to application called dimwitted which is basically an implementation of what I just told you in C++ and they are actually they got a lot of press because they’re 100x faster than everyone else who’s done this before and the reason is because they did the very smart things I talked about by like limiting communication across sockets and getting very good scalability you’ll see that they’re the blue and they’re getting basically four X and four sockets which is they’re essentially linear speed-up but now look at the absolute numbers and you see that we’re even faster than them and the reason is because we’re equally scalable that our single thread performance is much better than theirs because we have this great metaprogramming generative approach that generates very efficient code and they Rho C bus bus they actually want to be able to read so that cost them something in performance so here we have okay you have an algorithm the optimal is implantation really matters because we’re talking about a difference of you know over 50 X here and finally I want to direct your attention to this last bar on the right the GPU so if you sort of do your back of the envelope peak calculations you’ll think that the GPU should have done way better than this if you look at the amount of and what force this machine has versus the amount of bandwidth the GPU has in the perfect magic world the GPU should have been really close to the 48 CPU bar it should’ve been way up there but because Gibbs sampling is is a essentially a graph application it has a lot of random reads and as can we mentioned earlier random reads do not go well in the GPU it is very hard to get anywhere in your peak and so correspondingly the GPU suffers and so can we showed you an app where the GPU just destroyed the CPU now I’m showing you one where the GPU does not do that well and so the take apart takeaway here is the right hardware matters a lot you can’t just say all right GPUs are the way to go let’s just write code for that you can’t just go for CPUs it’s really you got to pick the right algorithm you got to get the right implementation and you have to find the right hardware for your program and this is why we think lye is important because you know we you don’t want to rewrite everything every time to do all this so I’m going to go into some scalability results for you but first I want to show you some potential performance a fun takeaway is scaling is easy when you’re started out in a hole never trust anyone scaling results that don’t show you where they started from because there’s been some very interesting work lately showing you know there’s a lot of distributed frameworks out there that you know we scale out the hundreds of machines and we get amazing speed ups but what they don’t tell you is that how much faster or slower they are than sequential C code and so why should I use a hundred machines if I can write a C program it doesn’t even need threading and do it in one machine that’s how to be better so this slide is basically to prove to you that with the light we can’t actually so this is sequential light versus the sequential C++ and the C++ is as hand optimized as we can think of and it’s just to show you that we are in the right ballpark that sequential delight after we generate code into our optimizations is in fact very close to what a smart C++ programmer would write I use the term smart loosely of course and the main difference if you see we are on average a little bit slower the main difference is currently that the imperative C++ very aggressively raises memory and because we are a functional paradigm we still allocate a little more memory than we strictly need and it’s cost you something all right so now I’ve convinced you we start in the right place let me show you that we keep going so the first two apps here our triangle County and PageRank and so this is compared to power graph which is one of the more popular graph analytics engines out there today and they have they have reasonably good performance compared to a lot of their competitors and you see that actually on one one thread it’s performance is pretty similar but because they don’t account for Numa issues like I talked about they stopped scaling once I go once you get off one socket whereas since we account for new mode we keep

scaling and so once you’re up to four sockets which is the 48 thread count there the difference is around five to ten X and of course it’s less than five to ten X at one thread their difference here is really scalability and similarly I have some machine learning and data query naps that we compared a spark where once again spark is sort of the you know the thing everyone talks about and they do decently well on machine learning to have applications so once again you’ll see that you know at one thread where I showed you spark results earlier that showed we were like you know in the two to three X range over spark right but now you’ll see that well when you look really look at the scalability we start to three X but you start going to massive machines and it’s getting ridiculously huge so in this these data query naps we’re getting like 30 to 45 X and this is in part due to the fact that we have a better data representation than they do and interesting to just lower level code and then the machine learning apps it’s down to about you know 10 X and that’s because in the in these apps we the day allow layout is essentially the same and so it’s only the the code gen that matters here over ourselves at one thread yeah yes yes so thank you for clarifying that so there’s three there’s three bars of us in one bar of our competitors so the the blue and the red are with or delight’ slash optimal not using Numa optimizations and you’ll see that some cases it just stops scaling in some cases on the right it actually gets worse and the difference between pinning and no pinning is and no penny we literally just launched a bunch of threads and let them run and then pinning we have the OS actually tie a thread to a core but we don’t do smart memory tricks and then finally Numa is right optimal with the extra Numa stuff that we’ve added and that’s the one that is going well and then spark is purple and paragraph is white blue yes well unfortunately none of these are really computational k-means is kind of on the border hmm so PageRank and a lot of these abs actually do very little compute per memory operation k-means is actually one of the more to compute intensive ones but as an act their pre memory bound yes yeah I agree it’s the notion of mainly optimized code is nothing but I have to show you something mm I yeah I agree that’s a a good way of

presenting the data I can tell you just by my experience that like a lot of these say like this one in this one this one they essentially are streaming to linear memory and we are maxing out like so nothing really fits in cache so you’re basically limited by your main memory bandwidth and we are getting pretty close to peak on that yeah all right actually agree um yeah as Jarek said there is various validity in saying that like that whether you can argue that there code is not very good but it is what they wrote so that kind of and we assume that they’re relatively smart people so it’s at least what people are willing the effort they’re willing to put in without you know spending months and months on it all right well yeah point taken so were there more questions about this graph okay so that is the end and I would just like to wrap up by saying so at PPL we have a few faculty members clearly of course Hanrahan we’re collaborating with Martin he does great scholar stuff for us and the all the ml stuff I showed you if ready to pneumo is actually originally done by Christopher ray and then we that was the dim-witted thing I showed you and we ported it to light there we are a bunch of students Oh Jarek is nude but you know close enough and you know when this flag was made he was still at EPFL young for me and finally I would just like to say you know we hit some of Kuhn lays points that the if you’re going to be power-limited you you can’t just rely on your cpu D at faster and faster anymore right you need pearls in the specialization that’s the only way that’s you’re going to keep scaling as I think is been hit on many talks here is that one of the good ways to actually you know make productive dia cells is through staged embedding and that with dia cells we can actually you know get high productivity for the user and high-performance by using this common infrastructure and finally you know because of stage II and because of Dwight building hdsl becomes incremental rather than like oh crap our to build a whole nother compiler from scratch and then finally you can get high performance execution by through staging dream abstraction through these fancy rewrite googles I talked about three domestic optimizations and finally through like architecture dependent mappings for GPUs etc and if you do all that you actually have a chance of running high level programs really well on multiple architectures so thank you any questions okay so that concludes the formal part yeah okay quick show of hands who got some horrible error trying to install Dwight okay that’s less than I feared so that’s good shark did you raise your hand okay so how many people okay yes questions it’s it’s Python to seven I believe yes thank you that is uh-huh

you uh you you had Python to you to the talk but yeah sorry about that any other pressing issues before okay so can anyone read that all right can anyone who’s not in the first heroes recap okay well sorry need to undo this innovation thing because it wants to go okay guys why not I need it okay better where’s enough two more don’t know why that shortcut doesn’t work all right so sort of in this demo I want to show you sort of how you can build these cells into light how you can run applications debug them and then essentially go into any burning questions you have so I want to motivate the first part of this by you’ve seen a lot of LMS code this week and you saw that and well I think it all makes sense there is this kind of issue where the staging has a price and that is like the amount of things that you have to type so just for an example I’ve opened up boolean ops which is just a you know part of the standard LMS library that lifts boolean operations and what you’ll see is that at this file actually only defines three things it defines negation and an org but you’ll also see that it’s a hundred lines long so let’s look at like well the things you have to do which hopefully you’re familiar with most of this already so the first piece we have here is just like this is essentially the syntax that you want but I want to call it using index notation with an ampersand ampersand it forwards to this method then this method is abstract returning to wrap and as you know since we have this abstract repla that is different from the Revels xpg layer you need two different traits right so now this trait does the well now we’re actually in the ex world so now negate is defined to create and I our node represents negation so of course I have to define that I our node and then in order to make transformers and other optimizations work properly I have to explain to LMS how to you make a copy of that node that’s what this mirror method is now I might want to write some optimizations so here’s very standard pattern matching so like the negate of my gate is itself and you know and of false is definitely false reasonable things right here and finally act to emit codes represents all these operations so there’s Scala I have koujun and then for C I have coach in and I’ve helped myself a little bit by saying that this is actually C like things and that CUDA opens sale and C or how all stee-rike and they inherit the same coaching but the point is that once again we define three things and this took a while to write so but it’s also very boilerplate and if you kind of learn the pattern you can do it very easily so obviously the solution is to add another layer of abstraction that’s solution to all problems and so we have a meta DSL called Forge which essentially lets you declare your DSL with much less syntactic overhead and we Auto generate this file so the now I would like anyone who’s following along to hopefully you have a searcher to open up something called simple vector DSL dot scala attend forward source examples yes I can okay this is it so you may not be able to read this but

it’s definitely shorter so here I’ve just defined in fixed operations negation or and their type goes is over boolean x’ and then for my cogent I have some strings that represent that and then I’ve sort of met a program doubt that will in fact the Scala CUDA and seojun’s are all the same so I just sleep over them yes that is not important so forex generates the like essentially the API but but everything but the optimizations and then you add those after so there one yeah so going back the one piece that is not covered here is this but everything else I showed you in this file was automatically generated but then after you generate that file you can open up manually a new file and write these manually and the reason we did this was basically we couldn’t figure out a more succinct way of doing pattern matching in Scala already had so we figured let’s just do the scholar thing so now one thing I kind of want to emphasize is there are a lot of ways we kind of want to make dsls easy to write easy to compose and so we’ve essentially in the perfect world dia cells would be as easy to use as libraries right if you want a bunch of them you just import them all and you mix them together and things work out and so in Forge flashlights we do have a limited version of that in which case essentially so my DSL here is called a simple vector and right now in import Scala ops which is just like primitive arithmetic boolean x’ basic features of Scala that you probably want and then the actual thing it adds which is vector operations so now I want to extend this DSL to do matrix operations and so and now I’m just gonna say that well now this DSL is not just factories where’s actually vectors and matrices and here I will define what made you see that to art so my strong recommendation for you know using Forge is as I said to a you know steal and cheat heavily so I have vector up so why not just see what they did and copy it let’s see why you do all coding right and for those of you who get really lost you’ll see in the same directory I given you the solution so things go horribly wrong you know how to copy paste I hope alright so the first thing we need to define is some are basically our type so types have two components one is the the user-facing type that everyone sees which is what’s called vector and now we’re in a KO matrix and it’s parametrized over a single type team and now the other half of that is actually how do you implement that type and so data types in doe-eyed lms board what are they gonna call it are limited to essentially structures of arrays and / moves and so data is sort of how you declare a structure so you’ll see a call that matrix and say a matrix is a struct that has a non Rose field which is an int and I’m calls field which is an int and a data which is an array and so now we just need some operations to do on this data structure so the first thing it would be nice is if we could actually create one of these things so continue to steal and now let’s sort of look at what this line says so static here is there are multiple types of methods there’s the two main ones are actually three main ones there’s static which is like I which means that you call the method on the object so in this case vector static apply is like a constructor installer right you would type vector open friends pass the thing in and then the other two are in fix which is an infix method and compiler which is essentially a private to the compiler method that you don’t want to you need but you don’t want to expose to your users I’m going to say that I can create a matrix by giving it in this case to int so num rows in them called it will return me a matrix of teeth and now you see this very interesting little lame here so as it’s kind of been hinted in the previous

talks in LMS everything is considered pure unless you declare otherwise and similarly in forge every operation has no effects unless you declare them and the thing you want to be careful of is that if you write something with effects but don’t declare the effects bad things will happen so in this case I’m declaring that I wanted to create essentially an empty matrix thing that and I want to be mutable so that I can write to it so that allocates a matrix which is as a constructor so zero was our first century than dollar one which is the other entry and finally we fill it in with a reallocated empty array and this is actually the product of the two sizes alright so now I can allocate them but I can’t do anything with them so the first thing I would like to do is design some basic ways of accessing the fields so the next thing I want to add is this quick little trick here so basically we have this syntactic sugar where everything you define inside this the scope here automatically it’s kind of like telling your method in a class instead of in the object everything outside it seems that the first argument to the function is always the the matrix for everything inside this assumes that the cell for this is implicit so essentially everything you put inside the matrix ops is kind of like a method on class matrix and the thing there’s always this thing that’s available to you so see what I want so let’s get a few things let’s get basic getters for the fields so so now this first one I have labels compiler which as I said means that I want to be able to get the array out of the matrix but I don’t want anyone else to be able to and it is a function that goes from no arguments to an array of T and so get er is a special method available on structures that essentially accesses the fields you declare and of course this must be one of these fields up here or you’ll get an error and let’s make a public method that allows you to read the let’s not call it that they’re rows into columns which is once again oops a function to no arguments is a kidder hello and similarly for columns really so now so now this is part of my API users can ask what the num rows and impels is and the way I implement that is just by reading the fields that’ll be back in data structure all right so now that we have that yes thank you since why I have people watching me that is exactly correct these should return int yes this is why the solutions available so when I screwed up we can cheat all right so now similarly I want to define apply in update methods which will allow me to access data elements so apply takes an inch and goes to a tee and now you’ll see this new kind of thing composite so composite is basically a node I want to implement in terms of other nodes that already exist the the other side of that would be a cogent node which is something I want implement as a string so essentially when I say apply is a composite I just mean it’s very much like you know a method that calls other

methods and so I can do this by calling you know array apply of matrix raw data of the self pointer and similarly for update I know I know writing things is considered bad but we’re gonna do it anyway so that takes o and no one caught that one matrices are usually conducts by two ants these are things I don’t want to write alright which just says that you know I’ve essentially flattened the array on the back end so when you do something Dex arithmetic to access the logical value and similarly for update that’s that a little more so what’s interesting here is once again you see that because updates affect phul I need to add this effect annotation that says I’m actually writing to the matrix and so what you’ll so a question we get sometimes is sort of what you know what happens when you forget these annotations and a few things happen first of all if you forget the mutable annotation and then you try to write to that structure LMS will complain that you’re writing something that’s not compared to mutable so that one’s relatively easy to catch the effect right lines a little bit harder so in general we can’t tell like there’s no way to prove that what you implement does are right but we have a trick which basically says that you know anything of unit is probably affect all other wise it does nothing obviously so if you if you follow the convention where things any mutable operation always returns type unit we can catch these as well and basically it will give you a warning that says you have something it returns nothing and has no effect what are you thinking but on the other hand if you happen to have update returned a non unit value there’s really no way we can check that and you’re just gonna get bad results all right so yeah so apply update all sounds good and now I want to show you how okaythat’s do something actually it’s actually parallel and gives good performance so let’s do matrix addition so that goes from a matrix of takes another matrix of T returns you a matrix of T and there’s actually two ways of writing this and I’ll kind of show you quickly how that looks so the first version you can do is kind of the shorthand version you can say here that plus on a matrix of T and where I’ve limited T to be a numeric types would actually know how to add it implements this concept called zip whereas if directly of course goes to it lights if with on the backend and then it’s just a the function signature is TT goes to T and the actual thing you want to define of zip is plus and then the other thing you can write potentially that gives you exact Simcoe at the end of the day is well as I said you can always write composite operations that use things that already exist and say zip with does exist on arrays already so you could you know read the arrays out of the matrix zip them put the array back into a new matrix and return that and this is we’ve kind of debated on this a lot I think is ultimately a matter of taste this is certainly less verbose but it does require you to implement some extra API that actually tells to like how to access your collection so there’s this paralyzes parallel collection thing that basically if you want to use these fellow operators directly like map and zip and filter flatten app or whatever over matrices you have your matrix collection has to implement this API which basically says which tells us how to allocate a matrix and how to write to it how to read from it and that’s how like that really short things gets t-shirts

could actually compile or if and if you write it manually like so you don’t need that API so any questions so far okay yes yeah so the so Forge itself is LMS DSL yeah but using these constructs we essentially build a graph of the language and then using LMS cogent we omit Scala code that represents that language so it’s it’s very meta in that way you know where so one of the things that gives you is the ability to do multiple backends from the same specification so what you’ll see is that not only do you write a lot less code to get something that works so for example yep so we kind of Illustrated this with a boolean example but so here’s here’s what vector ox would generate from that support spec and it actually continues on into this file so that’s and then there’s more so this is the the implementation the back end that uses D’Lite and there’s a second back end called library which uses just base Scala which is mostly used for like essentially debugging and you know testing your programs for correctness before you want to go through the full compilation phase so in there you see there is yet another implementation of vector a which has all the same operations but translates them directly to scholar classes and Scala functions instead of stage things so now right yeah yeah you can’t read that either can you so let me just ask a question here like I can I can talk more about building DSL and forwards or we can say that like that’s the example and we can talk about something else is there a preference in the room well so the next thing I was going to talk about was actually like what applications and debugging looks like all right excellent let’s move on this is a they said like like at any time you just what you can obviously just tell me like what what parts of do I interest you and I’d be happy to walk you through it because obviously this is a less interesting for me hopefully then for you so all right so let’s just so okay what you’ll see is if you if you ran the the forge build command on simple vector you should have a a directory here called published and inside published you should have a directory called simple vector and now this is essentially the DSL that forest generated and I’m going to show you what an app looks like so this is hello world and simple vector so you’ll see we have kind of three pieces here it has this is the the trait here is what you guys have seen previously is the place where F of T is abstract and so the benefit of this of course is that you you know you write to the API but you haven’t really defined anything and you can’t introspect on anything the compiler does and then there’s this interpreter object

which is where rep of T equals T and so I know we’ve heard a little bit about this yesterday the idea that you know if you want to extract abstracted over the embedding process you can always just you know remove it essentially in fallback Tuscola and this is the library version I was talking about where if you run this version you it doesn’t do any magic whatsoever it just runs as a school library so it’s good for you know testing your correctness of your code but not much else and finally the compiler version rep of T equals x booty well this is evoking delight doing staging generating code now I can do that so yes let’s so open up a standard squat repple because it is just a library it is fully interoperable with Java Scala libraries rifle etc so if i just type SBT console in my DSL and wait a minute the repple is booted you know I can just try things out right so I have vector that was wrong okay so now I have a vector left by elements you know another vector I can add them I can move no I can’t no can’t do that either it’s a very simple DSL let anyway you get the idea using the ripple if you’re not familiar with the DSL say like one of our if you want to use one of our real DSL is like optimal or something you can open the library version up in the repple and just start typing things and see what happens so now if you want to run a real application where a real I mean hello world so you see this is the interpreter which and pretty quickly so now I want to generate code so let’s so now you see it says generating program and it finished and if I open up my generated directory oops I see there’s this folder called Scala and then there’s a bunch of generated code so I went a little more interesting yeah so this is basically what one of our parallel pattern loop turns into and generated code it’s a giant mess that you don’t want to read and correspondingly you have data structures and so now if you want so you always get scala but if you want say cruda you asked for if you want c++ you asked for it and if you want profiling support which is basically just we add extra instrumentation into the code to enable online profiling you asked for it so now my generated directory has some more stuff in it I’ve still got the Scala but I also have some Unicode and I also have some TPS quest code and what you’ll notice perhaps it’s particularly in the CUDA is that there were a lot less files for the crew Java and Scala and C and that’s because you know fudo can’t run everything so the reason we break our programs down into all these kernel files is that we essentially want to be able to run pieces of your program in different places so we for every we break your program got into kernels for every kernel we try to admit every possible target but we fail in a lot of cases and so then once we have okay well for every kernel these are the possible places I could run the runtime picks the place you actually want to go and it manages all the communication to make that work

out yes right so we have a graph that says you know for every kernel where where can you run and then yeah the scheduling policy is basically well if you ask for the GPU or input as much the GPU as possible on there everything it won’t go to the GPU we’re in running C++ or Scala I wouldn’t say that that’s guaranteed sure I mean they’re obviously this the scheduler is trying to do something smart but I won’t promise it always finds the right solution so as coonley said there you may need some kind of manual intervention in some cases to kind of control what kernels run where yes mmm-hmm it is very very simple it is essentially every top-level statement in the program is a kernel so you’ll see that there are a lot of kernels that do nothing and there are a lot of kernels that do a lot and so but so all these sequential kernels get essentially packaged together into one thread and then kind of inline together and so you don’t have a lot of overhead to actually call them but essentially the you want you know performance in delight you need to be spending your time in the parallel things and not that sequential random code so that’s but yeah there’s not there’s not any like real magic heuristic it’s just you know essentially every parallel pattern at the top level of your program that becomes a kernel and then we schedule the kernels yes so the so part of the runtime essentially once you decide where things are going we generate a little piece of code that manages like launching the kernels and by using the CUDA streams API we can actually overlap where when we copy data and when we run things all right so now I want to run this Oh actually let’s do this so something bad has just happened if you look at the source code some lazy programmer read read the input are eggs without actually checking that input ours were provided and so of course in you know an on-stage program you get a nice job of exception that said you know error at this place can you figure it out pretty quickly but normally like in the if you’ve generated code that error is now in your generated code not your source code and there’s this fundamental issue and in generative program about how the heck you map errors back to what the user wrote and so we have a facility for this that was built into Scala the called source context that we actually use and so doing that you see you get a slightly better exception which says okay well really the real job exception was index out of bounds in this generated code kernel but was like magic on top you tells you that well where that code came from was actually in the hello world line file at line 12 so hopefully that is useful it’s one of my favorite features but so any beckoned the because the the translation essentially happens like from source so a source to generate a code and then the like if you get an exception in your runtime we essentially trap the exception and try to figure out where it came from so yeah I I will warn you that like you know part of the in order to be fast we don’t admit like crazy error checking so in Scala you tend to get exceptions and seen you tend to get seg faults that it that is an issue I don’t know what to tell you about it so oh yeah back to an actual input let’s make this a little bit bigger so you notice that the actually first

time you run into white program it’s actually compiling all that code we generated so it takes a few seconds to boot up but then the second time you run it that code is cached so it’s much faster and then we can increase the number of threads we run on with this command all right and so that that was kind of anti-climatic right what let’s see what went wrong so actually so in order to see what went wrong I need to run with profiling on so this is the performance profiler not get bigger so well oh I don’t know why it’s rendering that way but you’ll see well it wrong is basically okay so each one of these lines is a thread and it didn’t get faster with more threads because nothing actually ran in parallel and so now if we sort of click these things it will jump to the source code which is where this OP that took forever came from and we can kind of see where we’re spending our time and from this we can say well maybe you know printing everything everywhere is not a great idea for a scalability so let’s uh just comment out some of these prints more friends okay so then since we changed the application we need to go through the full compilation process again it’s faster that’s a good sign and you see okay fortunately we’re still spending our time in this one place and that place is this while loop or were you know sequentially writing filling in all our vectors so the key takeaway here is that this is not a good way to write code and delight we as you may have seen you know we are very much encouraging the functional style and so I’m just going to replace this allocate vector and fill it in sequentially with a bowl constructor see I times two that was v1 and that goes to I let’s try one more time in their dead file you want to take the generated code moving elsewhere in write later right right sure sure hmm so there is there is a way um so the actually if you want to do that or the trick where you can do where you don’t have to always stage main as your entry point you can stage an arbitrary

function and you do that with calling register function and then the name of its call it through essentially and I would define a method foo down here so it has some you know type signature and so if I do that now the essentially the entry point the it stage is foo and then we actually emit a normal scala function or c function as the output and then so you can take that Genuity code call a scala see or GCC on it and then it’s just a binary that you can do anything you want with it let the okay progress let’s see no performance has changed somewhat drastically here it’s impressive right so I’m actually notice if you look at this file like you’re not where all the big operations are say you know plus times summations maps reduces group buys they’re all you’re all parallel patterns and as we talked about we can do pretty aggressive fusion and so when you wrote it this way but even it’s like me double fill-in vector like fusion couldn’t figure out that was safe to fuse but now that you’ve done it and the functional lifestyle this basically this whole program has fuse down into two loops and so you’ve significant are only have you made things more parallel but you very significantly improved your absolute performance and so so now the kind of last weird thing is this so is anyone have thoughts about why the heck we it runs like essentially why the parallel part starts so much later than the rest of it it’s it’s kind of non-obvious unless you really understand JVMs so the answer is that I’m running a really short program and jets have to warm up so if you for example see I run this once and it takes what yeah we really need to make this bigger first of all a little bigger all right and now I run it you know say five times it gets faster right and so now let’s go back to one thread for comparison point of view all right so now good things have happened what single-threaded takes a little over two seconds and then four threads was taking like 0.8 seconds which is like two and a half speed up or if you let you want to let the jet warm up it takes 0.6 which is three point something um so there are two things you can do one is to write a real program that takes longer and the JVM will do pretty well on it the other thing you can do is just run this and see because it needs a JIT whoa that did not happen before oh yeah sorry some of our some of the profiler code is is new and needs some robusta fication so first time point six seconds so now we’ve gotten over 3x beat up on four cores so clearly something good has happened and you see yep it’s parallel the whole time alright so now we’ve made a fast application and I’m ready to take questions no questions I yeah but like do you think I’m insane

a change or yeah so I mean one thing I’m not fully accounting for is his page faults right so the the reason javis beat up on all two runs is JIT plus the fact that all the page fault had already happened because then the reason allocated so yeah I mean but I don’t expect significant differences that was not the recommend so not the thing all right well that is the end of my demo so we do profile the whole program it’s um so it’s part part of it is essentially based on the kernels we wrap every kernel in a in a timer and part of it is sampling based so actually thank you for bringing it up all right I wanted to empower the C’s knockdown mm-hmm in the profiler I forgot the profile fly again all right so we’re species so that the spaces are when that thread is idle and then the the little blinds are basically kernels that were to smaller to to render yes this is just so actually this bar at the bottom represents the start and stop of your program so you can kind of see that my program started very quickly the first thread started nothing happened for a while they got parallel everything shut down and then this part here in blue is actually the memory usage you’re doing so you see it kind of goes up and up and up and then it stabilizes and we do this by actually like using callbacks into the JVM to periodically pull the state of the heap and then we that sampling based and then we report it to you in this graph so one of the you know things are often performance issues you’ll often encounter when you write functional programs say like you know because something didn’t fuse properly is that you read a lot of memory and so you can really see here that like wow and this kernel ridiculous garbage collection has happened that’s why it’s slow hmm oh no we rewrote it in-house I didn’t personally write it but it’s yeah it’s uh so it’s the actually so the the pretty part the the JavaScript is a external tool but our tool essentially did all the profiling and then put it into this potentially graph format JavaScript engine and then that render is into the cool-looking piece you did is d3 yes