Using the NDK Performantly (Big Android BBQ 2015)

Just another WordPress site

Using the NDK Performantly (Big Android BBQ 2015)

DAN GALPIN: Android Internals– Writing Performant Native Code If you haven’t heard enough about me already, I have spent 5 plus years talking to developers just like you around the world, and it is awesome to be here in Hurst And I spent 15 years as a software developer before doing that So I have a little bit of street cred I actually started developing Android right around Android 1.1 Seriously actually, I started developing commercially then I was working with it ever since 1.0 I wear a lot of hats This is like one of the smallest hats that I wear And I have no shame And I think I’m kind of funny sometimes, especially with lack of sleep like now All right Performant is an entirely invented word It is not a real word How many of you knew that performant wasn’t a real word? OK Good, I got a bunch of English majors here, that’s awesome So actually according to Urban Dictionary and most of the research that I did– because I do extensive research on a talk like this– performant was actually invented by software developers OK? And there’s some theories behind this But the Urban Dictionary defines it is having adequate performance And but really this would not be nearly as cool of a talk– native code having adequate performance So this is why the word was created, OK? It just doesn’t have the ring to it We don’t like adequate in our industry We like awesome So yeah, so rather than title it that, I use the invented word And we’re going to be talking about really, really tiny benchmarks Because in order for your app to actually perform well, you have to do everything and make sure everything happens within 16.67 milliseconds That is how you get 60 frames per second But most of the benchmarks that I’m going to be talking about in this lecture are in nanoseconds So we take this into nanoseconds That’s a lot of nanoseconds So this stuff is really, really fast So don’t worry when I tell you it takes five times as long to do something on one version as the other Because it’s really fast still But it’s good for you to know All right So again, do not panic Remember the clock speed of the Nexus, which is what I used to do most of my testing, because I wanted something that could run KitKat as well as run L&M, is about 2.3 gigahertz, tops, with four cores So that’s like billions and billions of instructions I know, I’m like sounding like Carl Sagan here So we’re talking in very small micro benchmark terms Now Android Internals is this kind of thing I’m working up I want to see what you guys think of it So afterwards we’ll have a quiz And it really is about the unique voyages of discovery we can take in an open source platform like Android So the idea is not just to understand how to code Android, but understand how it works, so that when you run into problems you have a better idea of actually what you’re dealing with And we’re going to do it kind of this way– we’re gonna actually test our assumptions We’re going to benchmark We’re going to look at source code And we’re going to debug, even in native So this voyage one takes us into the land of optimizations in ART If you say my talk yesterday, I got a bit into that But this time we’re going to be a bit more pragmatic, because we’re really going to talk about how the world of J and I has changed as we’ve moved from a world of Dalvik to a world of art But I am getting ahead of myself So let’s talk about native code Most of you guys have actually done native code But I’m talking about code written using the NDK And we’re talking about primarily C and C++ code that interfaces with the Android runtime using J and I Here is a really, really abbreviated architecture diagram of what this looks like Applications written with the NDK take the form of these, you know, decks classes, that execute on the Android Runtime They interact with system libraries by the SDK framework classes And SDK application code is written in a language like Java, that the runtime can support So the Linux kernel was written primarily in C and C++, and so are the system libraries The framework and the Java runtime call into these libraries using the Java Native Interface or JNI Now the NDK essentially allows you to write a dynamically linked native library But it can’t run directly against the system libraries, because these ABIs or APIs aren’t stable So the purpose of the NDK is to give you a stable application binary interface to run your own compiled code against, that provides access to only the most critical OS features, so the platform can still continue to grow and expand, and change how they implement things, and be awesome But your application code is talking to this through this ABI It’s all important stuff And that’s what it looks like Boom, your application code now talks to your library, which

is going straight to native We talked to you a little bit about the history of the NDK The original first versions of Android did not even have it But we got it in Cupcake And we’ve been slowly expanding it ever since So in the first versions of it, you got a C runtime, really minimal C++ support, Zlib compression, logging, networking, dynamic linking, some math– not a lot, but enough We then added graphics So the first version couldn’t even talk to OpenGL But we added graphics there And you know what took the longest part about this slide was actually trying to find all these images again It’s like, when have I used a slide with these images? And then Gingerbread really expanded things Gingerbread got much more serious about gaming and multimedia So we added our native application API So you can actually build That was the first version of Android that you could actually build a native application without needing to use any Java whatsoever And also sound, which is really cool, like Open SL was really nice to have We continued to evolve it In Ice Cream Sandwich, we added the OpenMAX AL media layer Not many people know this, but you actually can access renderscript directly from the NDK, as of KitKat And it’s pretty cool stuff It was a long request And we also did a bunch of graphic stuff here That’s why these aren’t in order But in Jelly Bean MR2, we added ES 3 And in Lollipop, we added 3.1, as well as 64-bit support So that’s pretty cool So let’s talk about some assumptions So our assumptions are basically to follow the suggestions that the Perf-JNI article If you have not read this article, it is the gospel for looking at how to deal with JNI on Android But do they still make sense today? We haven’t updated the article since we shipped ART So here are the basic things you have to do, OK? Absolutely critical when you’re trying to make JNI performant One is you’re going to cache field and method IDs You’re going to do it intelligently Two, you’re going to GetStrings in a reasonable way And you’re going to copy things in native These are the only three real tips we gave But I’ll go more into details So how did we benchmark this? We actually used something called Caliper Now how many of you have actually have ever heard of Caliper in this room? No one, that’s good Oh, one person, sorry I had never heard of it before doing this But I was interested in doing benchmarking It turns out if you actually look at AOSP, we have Caliper tests checked in This is actually how we benchmark our VM ourselves And we use this thing called Vogar And if you actually look at what’s checked into Vogar, it’s a really ancient version of Caliper I’m hoping some day we actually update that It would have made my life a little easier But Caliper is a really cool framework for running micro benchmarks All right So let’s get to the first thing This– if you haven’t use NDK before– is how you access a class from native code So once you have the class, and I’m passing this class in from Java– you can see jclass type– that is actually the class information And that’s what we call it GetFieldID, the name of the field, the type of the field– so in this case, integer And then finally I can call GetItField to actually pull the value So that’s how we actually access an integer that’s inside of a Java class from native code All right So the first suggestion, which is a really, really good one, is to cache field and method IDs And here’s why Those field and method IDs are just numbers They don’t actually change once the class has been loaded And if you want to be really, really good about when you actually grab them, inside of the static initialize is a class in Java You can actually call some JNI code– in this case, I’m calling ir nativeInIt And inside of nativeInIt– it really shouldn’t have been named nativeInIt, now that I’m looking at this slide, but that’s OK You could see I’m getting that field ID And that field ID will be good as long as this class is loaded So that’s pretty awesome I don’t have to think about it I’m just storing it in a little variable there that’s associated with my native class All right Let’s talk about performance So here is how it benchmarks on a Nexus 5 running KitKat and marshmallow And you’ll notice something ART takes longer That’s going to be in general, a common theme ART is more complicated than Dalvik in general And so it’s even more important today than it was initially to cache these things Because devices are running faster And ART is already faster doing most things So you’ll even notice this maybe a little bit more than even these benchmarks should say So let’s look at the code and try to figure out why this is faster, or slower I should say

And the really key thing is this thing here, this ScopedJNIThreadState and ScopedObjectAccess This is why JNI actually does not run at lightning warp Speed And that’s because every single thread in Android can be in one of two states It could be in running state Actually it can be in more than that– but two states that we care about It can be in running state That’s actually when we’re in the Java virtual machine, and we’re actually executing stuff And it has access to all that great– sorry, I should say the runtime We do not have a Java virtual machine in Android You can strike that from your memory The Android runtime– that’s when it’s in there Or it can be in the non-running state or native state So when we’re actually accessing a variable like this, which is an int field, all of these are variables that are inside of the runtime We actually have to switch the state of our thread in order to do that And that means we’re doing a whole bunch of synchronization And that synchronization is expensive It’s expensive on the order of about 300 nanoseconds Now to give you some context, because 300 nanoseconds is a really small number In the average function call, an average function call in ART is about five nanoseconds In Dolvic, it’s more like 10 So once again, we’re talking about something that is a really tiny number But it’s still like 60 times longer than a standard function call So it’s still something to think about So let’s look at our first work hard And yes, based upon our benchmark caching field and method IDs is great for Dolvic and ART It’s even better in ART All right, so let’s look at the suggestion two of this, which was use GetStringChars Now this was kind of interesting So basically as you probably all know, the standard for Java– which ART also follows– is to treat all strings as double byte character strings, UCS2 And this is important because we’re in a world that’s highly international, single byte strings are kind of passe, et cetera, et cetera Not to mention as it turns out, the VM actually doesn’t particularly have great instructions for dealing with bytes So it’s actually kind of nice to have these things in these two byte characters, thank you So the suggestion here is that rather than getting string UTF characters, like we have there at the bottom, we actually call it GetStringChars, which actually takes our string it gets us the closest to being a native representation of it that you could imagine And we would expect this to always outperform the UTF equivalent, where it actually has to do a copy of memory All right, so let’s look at this I took a 15 character string I ran some benchmarks on this And I was actually really astonished to see two things One, as we expected, ART is actually slow But it’s actually much slower And two, this was a real shock if you look at those two blue lines, on a 15 character string, GetStringUTFChars actually performs faster than GetStringChars How can that possibly happen? Because we already said GetStringChars doesn’t have to copy the string, it doesn’t have to translate the string between UTFA So something is happening to actually make on this very short string It’s actually faster to do all that copying and translation So let’s try a longer string, just to see if I’m crazy here So this is a string that’s 100 characters long And we see more of what we would expect So GetStringUTFChars is now slower than GetStringChars So the question is, why was GetStringUTF ever faster under ART? So let’s look at some source code So you can see here that GetStringUTFChars always has to copy So what it does, it goes through, and it actualy just goes to that copy operation Well, GetStringChars actually has to go and check, huh? Well, can I actually avoid this copy? So it actually goes, looks at the key, and checks to see if that’s a movable object And it turns out that’s actually somewhat of an expensive call So you can see here, there is this fine continuous space from object That just sounds dangerous and what is this actually doing in real life? Well, it actually calls this, which has a for loop into it, which looks to find continuous spaces So you can see already here, even though the VN is doing all of this work to try to avoid this little tiny 15 character, 30 bite mem copy, it’s actually

failing to run this particular case optimally And so some point in between 100 characters in 15 characters happens to be the break even point What does this really, really mean? Is it unless you’re passing very, very large strings around, do what’s ever the most convenient to you, honestly It’s not a big deal You have to do a whole bunch of crazy stuff in native code to actually make your code handle two byte strings– two byte characters, it might or may not be worth it You’ll probably want to look at actually profiling it So here’s our scorecard So in general, yes, GetStringUTFChars is going to be faster for large characters, but not always I’ll give it 3/4 of a star for ART Here’s another suggestion that came out of there, which is use GetStringRegion This is kind of interesting So here is what that looks like So normally if you want to copy a string into a native buffer– in this case in my native, I mean just literally a buffer of characters You’re going to call GetStringChars, and then you’ll mem copy it, and then et cetera, et cetera And you’ll see I’m also doing some memory deallocation here, just to be fair on both sides You can see it’s actually several lines of code and several more accesses Because every time you actually do something like GetStringChars or GetStringRegion, you’re actually talking to the VM as well Well, this is actually kind of cool Sorry, you’re talking to native code as well and to the VM So this is kind of cool here And you could actually use GetStringRegion and GetStringRegion does the copy for you That’s kind of nice Also one thing I’m doing here, which is a nice little optimization is I’m actually passing the length of the string into this And that’s kind of cool Because as it turns out, passing extra parameters into JNI is almost free It takes literally on the order of a couple of nanoseconds for every single additional parameter you want to use So that’s awesome And if I were going to actually query the string, and say give me the size of the string, that would be another 300 nanosecond round trip through the machine So adding additional parameters is a great way of optimizing your JNI So I thought I’d point this out This is sort of a little minor optimization here But these things are what you’re thinking about Again you’re trying to avoid round trips on both sides You’re trying to avoid extra calls into the VM or into the runtimes, I should say, from native code And you’re also trying to do the other way You’re trying to avoid extra calls into native code from the run time All right So what does this really look like after all of this? Well, it’s kind of as you’d expect GetStringRegion is way faster You’re avoiding doing an extra allocation And so that’s going to, in general, be good And you can also see in ART is actually a lot slower than in Dalvik And a lot is all relative Again these are all little tiny things You might think after this talk that ART isn’t very fast And I don’t want to give that impression to you at all In fact, ART is scary fast at doing almost anything but this So in almost any other way it is going to blow away Dalvik So do not take this any kind of indictment against ART In fact, you could also– when I was asking one of the internal guys about why this case, ART actually was written in a time when we had multiple processor cores in a system So when they started designing it and writing it, they were thinking the entire time about deadlock problems And I would say that ART takes incredibly conservative approach to make sure that you’re not going to have deadlock And if you look in the list of bugs on AOSP, you will find deadlock bugs in Dalvik– most of which have been fixed But I think part of what you’re seeing is the art team wanted this to be incredibly robust And that’s why you’re seeing a little bit of this So maybe in the future, we can actually figure out how to make these even closer together But that’s what it’s like today All right So another big win on ART, and a big win on Dalvik to use GetStringRegion All right Let’s talk about a problem that a lot of people have which is sharing raw data with native code And this is also part of this Now if you haven’t figured this out by the talk, JNI calls are relatively expensive And you know again, this is relative We’re talking about five nanoseconds for regular a call, to about 300 nanoseconds– on a Nexus 5, to be fair– of a JNI call So what are we really talking about, the overhead of a one-way call? Or I’m sorry, a two-way call This is a two-way call So on Dalvik our overhead was about a little less than 130

nanoseconds On ART, it’s almost twice that And good thing that devices are getting faster You can see I’ve also benchmarked the Nexus 6P and a Nexus 9, both in 64 bit mode And you can see they’re actually pretty fast But even the Nexus 9 actually doesn’t outscore Dalvik running on a Nexus 5, for doing these kinds of things So JNI is expensive And the real goal of all this, and if there’s any takeback from this entire lecture is, avoid chattiness Every bit of chattiness you add adds extra time And a lot of that is stuff you don’t even think of So for example, let’s say you’re like, you know what, I’m going to avoid writing a whole bunch of code If you’ve ever played with Unity– how many people here have played with Unity? So one of the ways in which you talk to Android from Unity is to use something called Android Java Proxy And Android Java Proxy is really cool Because basically it takes it in under proxy interfaces, and it creates a dynamic class, essentially on the fly, that’s used to fill out some interface that you can then use to talk to a whole bunch of internal systems– which realizes that by doing that, you are getting the chattiest possible interface into Android And so if you’re trying to do something over and over and over again, that’s going to actually impact your performance So for example, let’s say you’re trying to read bytes out of some class in Java one at a time, you realize this is going to very, very quickly exhaust all of your CPU time on the main thread So you really do have to be careful with what you do on this And think about the interfaces you have between your native code and the VM All right Let’s go back over to this thing So how do we actually deal with sending big chunks of data between native code and the runtime? And there’s this cool thing called a direct byte buffer I don’t know how many people have played with direct byte buffers here before You pretty much only ever want to deal with a direct byte buffer if you’re working in native code There’s really no other reason for them to exist, as far as I can tell Although the VM might choose to not actually allocate this memory out of its normal page pool So on some VMs, it actually might get you memory you don’t normally have access to But in our runtimes, it does not And you get this nice allocate direct And then when you’re inside native code, you can just get an address for that chunk of memory, and start writing to it– which is really cool And there’s no like, I want to free this There’s no like, release address It’s one call So that’s nice and fast right? In theory So this is what this looks like when you’re using direct byte buffer So let’s look at the performance Again we’re looking at benchmarks here all the time on these things– keep me up sleepless nights doing these– and what that actually looks like And as you can see once again, this is a very, very slow access Because DirectBuffers actually involve even more synchronization And so then actually we’re talking about something that aren’t running a Nexus 5, is almost in the 600 nanoseconds range So you really want– once you actually grab this, the answer is you really want to use it for something If you’re using it to pass an integer, not a good idea You want to actually use it to pass lots of data All right But there’s another side to this Once you’re inside of code that’s running on the runtime, what’s the performance of byte buffer– direct byte buffer versus regular byte buffer? So let’s take a look at that What? OK, now let me just back up a little bit here Because you’re seeing something really, really strange here You’re seeing that first of all Dalvik Nexus 5 direct byte buffer is the slowest call by a substantial amount, compared to all of these other calls OK? Takes it takes 300 nanoseconds The other thing you’re noticing is a direct byte buffer is way slower than the standard one, which is backed by a standard byte array in Java So once again, two things that are kind of weird And wherever we see really weird stuff like this, other than scratching my head, it is time to go and explore some code, and try to figure out why that’s the case All right So here is what actually happens when you do allocate and allocate direct You actually get a different class We’re using polymorphism here, it’s awesome You either get ByteArrayBuffer or DirectByteBuffer, one of the two, OK? And as you can see, ByteArrayBuffer is backed by an array And DirectByteBuffer is actually backed by this class called memory block All right And here’s how we start reading integer We use the call and get Int And in ByteArrayBuffer, it’s pretty standard It actually goes into another class called MemoryPeekInt And inside of memory block, we have a little bit– an extra bit of indirection We actually have to call into the block

class, which calls into Memory.PeakInt, but a different call It’s that PeakInt is taking a backing array, and that other one is taking an address plus offset And yes, you are actually looking essentially at pointer arithmetic inside of the runtime right here– not something you see very often So what does this mean? Well, when you’re actually looking at how this is implemented– if you try to find the source code, this is what you’ll see You’ll see probably the most classic implementation of how to pull data from an array and get it into an integer that you see inside of the ByteArray class And inside of memory block, it actually calls into JNI All right, all right So now, remember– let’s go back to this graph here So we saw that ART is way, way faster than Dalvik at doing this And yet we just demonstrated looking at the source code, that it’s actually calling into JNI So that’s really weird Why is it so much faster? All right So once again, here’s what actually happens inside of that native code But that really doesn’t matter Because we’ve shown almost all of the cost of this operation is going to be in synchronizing between the different thread states, between the two VMS So it turns out that ART is actually doing a little trick And that is, when it actually declares the method, it’s declaring it with this little exclamation point on it– which is a flag to the VM that says, well, this is a dangerous function Actually it’s a flag to the person coding it that it’s a dangerous function It’s a flag to the VM saying this is a very non-dangerous function It’s not going to try to do anything in Java It’s not going to last very long So let’s not actually go through and change the state of the thread at all Let’s just run this code as quickly as possible So once again this is how long it actually takes to read that integer from a ByteBuffer Now it’s still about half the speed– even on ART– of our StandardByteBuffer call– even with all of that, even with this fast switching And that’s because if you actually go and throw this into a debugger, you realize that that whole statement about where it’s using lots and lots of shifts in order to do it, is actually not getting run at all It’s actually an intrinsic And so that’s how this is speeding it up And also even if it was running that code, ART is just really fast Like you know, it’s really, really fast And it turns out, there is some overhead in doing even this fast JNI call Because it still has to set up the call stack and all the other things that it would have to do to actually switch from running in the runtime to running native and that takes about 50 to 60 nanoseconds, according to my benchmarks just to do that, in fast JNI All right So is there anything we can do to avoid having to make a JNI call for every single int we want to read? It turns out there is We can actually get it all at once using something like this– so we can get buffer, we can allocate an array, we can wrap that in a new ByteBuffer, and then we can get that OK, and then we have to fiddle with the position, because otherwise there’s an overflow, there’s no fast call to actually just give me the contents of that buffer that’ll actually work So believe it or not, this is what you have to do– and what does that look like, if you do all of that, allegations? And this is even including deallocations and stuff like that And the answer is of course, it’s pretty slow It’s really, really slow on Dalvik You can see like– this is where you start getting into multiple levels of optimization here But if you’re going to be moving a lot, lot, lot of data around– big, big, big chunk of data And you’re going to be accessing that from with inside of the run time, then yes This is a strategy that might make sense for you Like for example, one of the things you might want to be doing is using like FlatBuffers, to move big C structures to the runtime How many of you are familiar with FlatBuffers, first of all, when I say that? All right, so FlatBuffers are really cool They’re an open source project that my team created And it basically allows you to do really, really efficient translation from stuff that’s coming in either from disk or from network, into structures that you can use It is about as efficient as you can get, given the amount of flexibility that it has It’s actually very similar to protobufs, if any of you have used that– except that it’s designed from the start to run on mobile, and to run really, really fast So if you’re doing something like that, you might actually get some performance out of this All right, so now since we have a little time, I wanted to show you just a little bit of how

you use JNI in Android Studio All right So once again, how many of you here have actually tried doing this in Android Studio? OK, so that’s not an enormous, enormous number of people But that’s OK Because this is actually really cool This made my life so much easier than trying to actually deal with the various things that go on in JNI Here’s a whole bunch of native declarations that are inside of my JNI benchmark class And you can see the kinds of things you’d expect, like these ByteArrayCalls and these string calls So let’s say I wanted to add another native method OK So I’m going to type Native And let’s have it return an Int– I don’t have to call it JNI, but just for consistency, I’m going to call it JNI pass a bunch of stuff to native So we’re going to pass let’s say a string, a ByteBuffer, an integer, a long et cetera, et cetera And you see a couple of things have happened here Probably the most useful thing is that we actually now are compiling the native code and the Java code, all in one Gradle build– which is really awesome Because we can do stuff like say, hey, this function actually isn’t found We can’t resolve this So you see, it shows up red It knows that it’s not in my native code So here’s the really, really cool trick For anyone who’s done a lot of JNI code, the ability to do this is awesome I can do create function here, click on this, and now I have a native function that’s been created inside of that C file And this is really cool First of all, it’s also done some helper things for me It thinks I might want to get this string interesting enough, into UTF rather than double byte characters But hey, you know, it’s probably what your code wants And then it’s also gotten the ByteArryElements for me And it’s released them at the end Because it’s assuming that you’re actually going to want to use these things And so it actually puts in that code for you So this is really, really cool And the best part is when I go back to my benchmark class here, you’ll see it’s no longer red It’s actually done the compile and we are golden We are actually ready to now run that inside of this class So if you haven’t had a chance to play around with Android Studio and in support of the NDK, I highly recommend it It’s still a little bit of work to get your Gradle project up and running Because you’ve still got to use the experimental version of Gradle But you don’t actually have to use the experimental version of Android Studio It is now in mainline So go check it out, play with it, and make sure that you’re not making your applications that actually use the NDK very chatty If there’s anything you can take back from this entire lecture All right, so I’m going to switch back into non-mirroring mode, so I can finish all of the citing slides that are left in my presentation, which is really just this If you need to get in touch with me, this is how you do it And I hope you have enjoyed the talk I hope you’ve learned a little bit I have time for some questions, if anyone wants to stump me, this is a really good chance to do it Because you most likely will But other than that, again, it’s not that it’s scary to use the NDK It is really cool to use the new Android Studio stuff And you just have to be cognizant of the kinds of performance problems you could create with it And I hope you’ve got a little bit of it from this And also once again, what’s wonderful about a platform like Android, an open source platform like Android, you can go and explore the code You can actually understand how we solve these very difficult problems in many cases And you can learn something and take something back with your engineering career, and use it again And that to me is half the fun of working in an open source project I mean, wouldn’t it be awesome if everyone could simply say, you know, here’s the reason why that doesn’t perform well Let’s go look at the source code And I think everyone should be able to do that So I’m super excited to be able to work on a development project that actually does have an open source backend So that being said, thank you very much for coming this morning [APPLAUSE] And I will take questions now OK? AUDIENCE: [INAUDIBLE] DAN GALPIN: I do not know That is a really good question You’ve now stumped me I’m so embarrassed That’s OK And yes? AUDIENCE: [INAUDIBLE] DAN GALPIN: Uh huh

AUDIENCE: [INAUDIBLE] DAN GALPIN: Mm hm AUDIENCE: [INAUDIBLE] DAN GALPIN: So if it is, like let’s say you’re doing like Java getString critical, which would mark that as being in use That will never be moved That is fixed in memory DirectByteBuffers are also fixed in memory, they can’t be moved There is a little bit of weirdness around that Because if you look at the way they are allocated, , there is a little bit of code that checks around moving them But once you’re actually accessing them, getting that direct byte buffer address, it is fixed in memory So it can be moved, however outside of that call So once that call goes away, my understanding is that it can be moved So again, it’s protected for the lifetime of that call, I think That’s a really good question I think that’s what I remember, and don’t quote me on that one I might be wrong It might be always protected But in looking at the allocator, there are actually two different kinds of allocation that can happen And for a very, very small– like less than three pages, it goes into the movable allocation pool And for things that are larger than that, at least in the current implementation, it’s not movable ever So yeah, kind of yes and no Yeah? AUDIENCE: [INAUDIBLE] DAN GALPIN: It depends on whether or not– so the question is, if you’re using a backend to deal with this data, and you’re talking to C++ code, ultimately you need to get that data into your C++ code, is it better to just use the networking services that are built into the NDK, or is it better to actually use and do everything in Java? And there’s kind of two questions I have about this Is the performance of your networking something you actually even care that much about? That’s the first thing If you’re not on the main thread, and you’re processing some stuff in native code, you might not care that’s it’s a little bit more expensive Because you’re not actually affecting the frame rate of your application And you might be saving an enormous amount of time by actually using the implementations that are in Java So as a general rule, you really want to look to see whether or not you actually care about that particular performance loss, and then weigh it Yes, for a performance you’re going to do way better if you parse something in completely in native code– especially if you’re not using it on the Java side of things Then yeah, that would make sense But the real question you have to ask is what’s the cost of that? What’s the cost in terms of opportunity? How much more time is it going to take me? Is it really worthwhile? And that’s– with all of these things, that’s what I say If it’s an easy optimization, like let’s throw a couple parameters into a JNI call, by all means do it Don’t waste more time, don’t waste more battery But if it’s going to mean rewriting an entire library, then really look closely at it and say, how much am I really gaining out of this? Mm hm? AUDIENCE: [INAUDIBLE] DAN GALPIN: Mm hm AUDIENCE: [INAUDIBLE] DAN GALPIN: OK, so the question is about JNI versus using Renderscript When does it make sense? So what’s really cool about Renderscript first of all, is that Renderscript is actually LLVM byte code that gets compiled on the device And there’s some beautiful things that you get from that One is that it can be actually optimized for that particular CPU that’s running on that device, to some degree They’re certain kinds of optimizations you can’t do for LLVM But there’s a whole bunch that you can There’s intrinsics that you can actually swap in and out There’s like peephole optimizations that are specific to actually how that particular device works So one of the secrets of Renderscript is that it actually can generate better code than the Compiler can, in some cases The second thing is it’s also running in a kernel It’s actually running in its own little tiny machine, that is used to run massively parallel stuff And it’s really set up to do that very, very well So if your problem space falls into Renderscript, something that’s really helped my parallelization, and something that is also helped by using these intrinsics that you get from the LLVM byte code, then by all means use it But as a general rule, I would say that again you’re looking at opportunity, time and cost If you’re not seeing that it’s a performance issue that’s

impacting you, it may not make sense to go through to that Part of the reason we have JNI is to be able to reuse all this crazy amount of C and C++ code that’s out there And so for me, it’s you always have to balance these things But from a true performance standpoint, it is very possible that Renderscript will be the highest performing way to do certain kinds of operations, because it can just do a whole bunch of things that the Compiler can’t do because it just doesn’t know enough about the system architecture And it really depends on how well the individual OEMs have actually managed to– or chip providers have actually managed to optimize the Renderscript Compiler on their particular chipsets So there’s a lot of variables here I wish there was a cut and dry answer But what’s great about Renderscript– the really cool reason you might want to use it anyway– even despite all that, is because as I said, the LLVM byte code gets compiled on the individual system So you only have to ship one copy of the byte code You don’t have to use a dependency on the NDK Or you don’t have to worry about it bloating the size of your build with a bunch of different executables And that by itself might be worth investigating Renderscript, just for that one reason Now with 64-bit, I believe you actually do need to ship 64-bit byte code so it not completely transparent to architecture, I think I haven’t actually tried this But I vaguely remember reading that somewhere AUDIENCE: [INAUDIBLE] DAN GALPIN: Sure AUDIENCE: [INAUDIBLE] DAN GALPIN: So if you do [? Maleks ?] and Freeze, it’s separate It’s actually using a different allocator It’s using JE [? Malek ?] when you’re doing stuff from the NDK, and you’re using [? Rozalek, ?] when you’re in the virtual machine And the reason is that if you went to the talk yesterday, [? Rozalek ?] is really, really good at garbage collecting in the background And we’re trying to avoid heap fragmentation by bucketing all of our memory allocations [? JE Malek ?] is not trying to have everything cleaned up in the background It doesn’t have to be as paralyzed So it’s a slightly faster allocator then [? Rozalek ?], when you’re in native So yeah, they don’t share space It’s been a long time thing in Android that if you desperately, desperately need to run something that couldn’t run inside of the heap space that we give you, in the run time you can add native code There’s other ways to do that too You can run multiple VMs Like by launching each activity into a different process There’s all sorts of ways of getting around this But even using ashmem is a last ditch resort, if you actually are completely out of all the memory, we allow you to do that But realistically, yes, they’re entirely separate heaps Yes? AUDIENCE: [INAUDIBLE] DAN GALPIN: Oh, it’s not that complicated It’s just if you are– so the reason I say– complicated was probably not the right term to use there It’s mostly that we’ve changed the structure of the way the Gradle files look So if you actually look at what we’ve done, we’ve added the concept of model into the experimental version of Gradle Oh sorry, you can’t see Let me mirror, let me mirror the display I just did like the dumb Californian thing here All right So now you can see what I’m seeing So if you take a look at the Build.Gradle here, you’ll notice that we’ve added this concept of model So now Android is now at the top, model is at the top So basically you need to go through and restructure your Gradle build a little bit, in order to take advantage of this There are some pretty good stuff online You also see like all the kind of standard things you’d expect to see in the old NDK build is there You can actually add libraries here And also turn on– sorry, static libraries as well as dynamic libraries here So pretty basic stuff You can see I’m not using any of this in that This is also how you build different product flavors I’m building x86 arm seven and arm eight Actually I’m building all, because I have all of these here It’s hilarious But any case, this is how you would do product flavors, and dependencies Again just like normal Gradle stuff So it’s a little different in structure but it’s really not hard to set up once you’ve actually set it up If you want, I can even show you debugging, it’s really cool I think actually– I may not be able to show you debugging We’re basically out of time But if you want I’ll show you debugging If anyone wants to come to a table out there, I can show you how the debugger works AUDIENCE: [INAUDIBLE] DAN GALPIN: So if [AUDIO OUT] two

options [AUDIO OUT] Google Play You can either upload each individual variant as a separate multi-APK chunk Basically those are all separated by version codes Or your other option is, you can put them all into one APK, and it will do the right thing when it actually launches the applications You’ve got two options It really depends how much native code you have, and what percentage of your APK size it is For some people, even having six flavors of their NDK libraries will only be a negligible amount of their space For others, let’s say you’re running something really big and heavy like Unity– you now, it has its own runtime and all sorts of stuff You’re definitely going to have to seriously consider [AUDIO OUT] distribute it on Play So a multi-APK is really the way to go if you want [AUDIO OUT] lots of different versions And [AUDIO OUT] native machine like that, and I highly recommend doing it [AUDIO OUT] use our translator to actually arm code It’s pretty fast, but it’s not nearly as battery efficient as x86 code So I highly recommend doing an x86 build as well And I think one of the big things I hope we do is make multiple APK even easier to use Because right now, there’s sort of a partitioning scheme we suggest And it’s a little bit more of a challenge to walk through the first time on the Play Store So I’m hoping we actually make that better I think out of time though So I can totally take questions afterwards But thank you all for coming I hope this was fun [APPLAUSE] And enjoy the rest of your barbecue [MUSIC PLAYING]