HUG Meetup Jan 2015: Using HBase Co-Processors to Build a Distributed, Transactional RDBMS

Just another WordPress site

HUG Meetup Jan 2015: Using HBase Co-Processors to Build a Distributed, Transactional RDBMS

we have wanted from splice machine is the CEO of baby talk about mental illness and Walker one thank you very much appreciate you providing us is new tonight goes into great talks really interesting it was spinning nicely in the flow we’re a little different than some database and data manipulations that your garden sets that we’re really trying to be a general-purpose database in fact in some cases we’re finding that we can replace Oracle power applications in real time on the new snack I’ll tell you some stories about that in some cases we’re finding a 10x price performance improvement it’s an affordable scale-out type of database shared nothing architecture it is completely acid compliant and it essentially allows you to use existing ansi sequel to to execute with very little rewrites the best how many stories in the context of what customers do with it our first customer that’s been vocal had a real big problem with the packaged application customs harte-hanks direct marketing service and they built a campaign management application from a number of components they built it from unica which is known by IBM standard campaign management application they did their reporting on ibm’s cognos and maybe manipulated data with the ETL system technetium all of this was powered by oracle RAC and we met them about a year and a half ago and they were out in market trying to figure out to do because there are horrible implementation for this application was writing to all they weren’t meeting SLA they were at risk of losing customers and we’re looking at a variety of different solutions from looking at other database platforms as well as just perhaps writing no sequel or replacing the whole stack so we said that we were trying to build a database that could replace no Merkel and they engaged us and said well let’s prove it gave us a bunch of queries and we’re giving them home oh hell about our time and they have been performance tuning the employees on rap it was a tremendous amount of partitioning on Jenkin other stuff that we stripped out we just ran this on our platform and we were basically able on a set of queries and five queries get getting three to seven times faster on a platform that would be one quart of cost and we were lucky enough to get them as a customer in there now replacing this entire this this entire layer of keeping the entire application layering going out oracle RAC and putting spice machine and so it’s what best wednesday as an example and the story other kinds of applications that we see out there are people just like the unitary example simply wanting to power real time concurrent applications and others are trying to use us in operational data links where is really important to change individual records rather than doing big batch of pens or batch updates so I think that probably the best way to characterize is to tell you how it’s built in and maybe give you feel for what we think we do well I imagine if I were to do a quick survey out there how many of your hive users like how many of you a spark users a couple any tags users but a lot of obviously I do MapReduce programmers out their eyes everybody do person for the most part were burned and he’s going to do that I imagine all of your big MapReduce programs so our purpose in life is to be able to do the kinds of things that a traditional database would do for oil TV applications what does that mean well all of those other sequel on Hadoop systems from the Apollo all the way through to high they’re basing they very good analytical computations they’re very good at long running queries they’re very good at bigtable skins and large aggregations typically when you have to touch every record but what

happens when you want to find one particular market or you have to do millions of updates in a very short period of time not just depends but updates or deletes what if you have to do small range games and what if you have to manage concurrency when you have many connections that database doing region rights at the same time and you need to keep that database consistent those are the features where splice machine would fit well versus some of the other sequel on Hadoop out there the way we built it is in the following architecture we came to market quickly because we started with some open source components means cook Apache dirty which is a fairly old relational database system of done second it’s NZ sequel 99 century you know central filed a system and we essentially connected it with HBS we take advantage of the dirty parser its planner updated its optimizer and a bit of its executor and we replace its storage left look later and i’ll talk more about this in as we go a little deeper so let’s derby derby is a hundred percent java relational database store in 1996 this class gave was quite by informing students might be a requirement for mix and I gettin contributed to go to Apache back in 2004 it has many of the flavors of gaming too it has many companies approval out there is typically uses in database out there and it’s not it’s our middle database and it is able to do stored procedures and as triggers it it has a isolation support it was a pretty interesting piece of software out there but the problem with it is it’s centralized and I guess the what we do here is if you were to take a look at how would we work and how we have you process sequel we take a prepared statement we look to see if it’s been processed before fits in cache we parse it in using dirty dirty creates a plan that plan gets compiled into fight code and that fine code instance is distributed to a bunch of servers and I’ll give you a view of that here basically what happens is and is it when you compile the sequel bytecode it gets serialized to each HBase known and it is executed locally on that age bass note utilizing page basis coprocessors how many people in here know what an age-based co-processing is a couple when they give a quick background on your own now it’s a new feature that was added fairly recently to a face essentially the same as a traditional databases use of triggers and stored procedures there are n point code processors that let you run a bit of custom code on every HBase node in the same memory space as the age based local data getting access locally to the data on a particular region server and then there are something called observer coprocessors which are essentially little pieces of code that you attach to the database mutation operators or access eliminators so on every get you run something or every put RB delete or every skin so you can attach your own custom code in there this allowed us to essentially execute full up sequel processing utilizing HBase and pushing computation down close to the age basement so essentially here what happens is you connect to a any one of the HBase notice that is also a single client you enter in a prepared statement you connect to the database and process a prepared statement compile is the plan the planet gets compiled into bytecode the bytecode is then sent to each one of these particular age based nodes and each one execute in parallel it’s code and then of course the results get spliced back together again in the execution of the wilco against the name of our company splice machine and that’s essentially how it executes a sequel plan so then our architecture essentially there’s there’s no MapReduce and there is no essential computation that’s being done at the high level it’s all being done

local where the HBase node is essentially we’ve we’ve done a few things too dirty in order to accomplish this so first Derby had a very flexible and modular storage component the storage module can be ripped out that’s exactly what we did here’s the blob base B tree type system and we ripped that out and put HBase in it used concurrency in a very old way using areas and locks a traditional log based concurrency we implemented an MDC see a multi-value can come to control method is traditional snapshot isolation we’ll talk about that in a minute in order to get full of acid compliance onto database it also had a few drawing algorithms we replaced to join algorithms with distributed joint algorithms right now we have a distribution broadcast join we are the sort merge merge and a number of other approaches as well as just natural joints and the what you would classically have to do for Cartesian product type of operations and we built a nice resource management component over this system that essentially takes tasks that are being distributed to each a HP and manages the computational resources of those nodes so that you don’t hit that loss and you can control what tasks are being used so that’s essentially some of the architecture we’ve done some interesting things inside of our system one we don’t do a straightforward encoding of a sequel table in HBase in other words if you have five columns in your sequel table we don’t have five columns in HBase table due to the bloating that happens the way the HBase represents things we do a very compact encoding and we have only one column for data and one column used for transactional information and we compact all data into that efficient representation it’s a sparse representation you don’t have to represent every column if if there aren’t any values for every column but it is a compact representation and the reason why we did that was because what we didn’t do that in our first version we found that somebody’s ten terabyte database would turn into a 15 terabytes of storage based on age basis inefficiency and how it represents data with indexes as well as the of course replication factor on do so by doing this very compact encoding we’re able to maintain a 10 terabyte data they set maybe on Oracle if it’s migrated over splice machine is roughly the same size we do have full sequel 99 storage I mean coverage we also did some extensions with some window shins or the customers needed some little functions we don’t do all of 2003 window functions but we do some of the important ones down like I mentioned we do lachlan acid transactions so the way this works is very straightforward for those of you who don’t know what snapshot isolation is essentially any mutator of the database is given a timestamp and you open up that transaction for that timestamp you can make any changes you like to database no readers will be able to see those changes until that transaction is committed and in order to create consensus of what is the current timestamp we use zookeeper as a consensus manager is that’s a standard component for the HBase system this research extent extends some of the work that Google’s in with percolator also some of the work done on it at Yahoo labs and some of the work that was done on dictation pieces of universities one essentially the cool thing about this is that it adds multirow multi-table transactional capabilities to hbase and it has a full control back system so no matter how many updates are being made that that those updates can be rolled back upon the system it also is is pretty cool it operates on indexes same time as data so that if you’re updating data the indexes for that data

including secondary indexes are maintained within transactional integrity so if you did commit your your data updates the indexes would be also committed and updated if you will back they were both be kept consistent we’re just a standard ODBC JDBC database leveraging some of the dirty components so any customer would be able to use their front end the I tools with us we look sequel 99 to anybody and if you think about the sequel database ecosystem and where things lie out there we are best utilized in circumstances learn there’s substantial concurrency where people need to read and write at the same time where you need to access small numbers of records very very quickly and to be able to do this all the transactional integrity if you look at the world you know we kind of see the world like this although the world’s blurring these days essentially they’re simple on Hadoop which is very analytically oriented is MVP which is very technically analytically oriented and then there’s some of the new sequel stuff that’s going on and some of the new sequel stuff that you hear that is in memory big analytics type of workloads kind of like what them sequel does and some of the new sequel is really trying to be transactional perhaps like nuodb I think we’re a moral in that world but we do benefit from being on the goop stack we think we’re the only transactional Hadoop relational database management system out there and we’re building a number of the components that you might expect that an OLTP system requires like backup and restore replication techniques and explained facilities and management consoles and all of the things that you typically need to run a mission critical application so our goal is really to take your do into a new class of work load whereas it’s been extremely well deployed for building models and doing a great deal of analytics a lot of match data manipulation data transfers et al but it hasn’t really been used other than with some raw range various implementations that hasn’t really been used as as a mechanism to power real time applications and we’re hoping the chance that we think that there’s a grand opportunity out there because people just really are tired of their legacy relations Systems either because they don’t perform well because they cost so much money and we’re welcome to take advantage of that so a lot of people are excited about what we’re trying to do out there it’s early days for us version 1 point 0 about 50 people in the company and just a few customers out there and so we’re really at our early stages but that’s basically what our mission is so I tried to keep it fairly high-level architectural about what we’re doing but happy to take questions from the audience agana yes okay so what do you do about the peerless you’ll say what do we do about lighting and eat again slowly really do horrible with you you see chemistry yes so hard thanks for example had thousands of plc well procedures coming out of that and when it first me what this was going to be hellish for us but turns out the remember Appeals sequel translation compilation things out there and what we literally did is take their fee else equal put it through a compiler and generate in our own java-based stored procedures we have a full stored procedure capability but of course it’s different syntax than PL sequel so let me implementation that’s heavy in PL sequel will require that stage of migration and its work but if it’s not months upon months of work it’s weeks of work and it’s been affordable for the clients that have done that with us that’s Matt curiosity why don’t you stir be not an icicle nautical sorry yeah so the question is one of you she was dirty and not post Prancer my sequel and the answer is the simple answers that me we wanted to stay within the Java environment the dude was job at age basement Java we just thought it would be more efficient for us probably not efficient for code but more efficient for us if we stay within one programming

language environment I think they’re good that was a good decision across many different you know dimensions but of course we hit some of the same kinds of performance challenges that you were about in previous tab in previous talks it even tonight we have to manage our heat very very carefully and we have to be very careful with that garbage creation the garbage collector with the two and garbage collectors it you know being a job that pays relational database has its challenges but we did for job it was a question right there I saw my does not score sure what I meant by the it does not store nulls in that compact encoding that we talked it up say you had a table that 500 columns in it sometimes when we said when we say the customers we make one column of data they think oh my god if I had 500 columns at 400 of them were knows this is going to be such wasted storage and all we being there is that our coding doesn’t make you have 400 explicit nulls in there linking we have delimiter that can indicate exactly what fields what columns are in the encoding so it’s it’s just an efficient encoding is what so could we use Bologna or more the mechanism Lord the Lingam great so loading the data we have to put a lot of the question is what we do for loading the data and this became incredibly important to us because our customers became has terabytes and terabytes data wanting to do a POC with us and you know it was taken a whole hell of a long time you get that data into our system so what we did is right a parallel batch right pipeline and what that means is is we take a let’s say a huge HDFS file and we batch it in two batches and we push it out to each HBase node and each HBase node can then process the ingestion in with pure transactional integrity to get parallel ingestion so we built our own batch import mechanism in order to get the maximum performance as we can and you know I think we’ll be putting out metrics of this our objective is is the easy be able to do you know millions of Records a second the question is how many nodes you need to do that and we’re doing those benchmarks now that will publish it on next release yes oh I think the mark is over there I’m sorry yeah so I got a couple of questions what is the you know we have you perform people this one was anything but yes great question so the first question about your phone yet we do know about the phone yet we think it’s great that there are other people out there trying to make you know a real time transactional to do based system although we we I don’t personally know a lot about that architecture but my architects tell me that it’s based on some very antiquated concurrency models that are so much lock oriented and not a snapshot isolation and therefore would not be able to power real time concurrent applications the way we would I don’t know if that’s really true that’s what I’m told I haven’t used it I’d love to learn more about it but that’s I think our current take on that I’d love to learn more about that the second question is what benchmarks do we have out there comparatively to other sequel systems we’re just getting underway on our benchmarking competitively as well as the TPC C&C PCH we think we’re going to be the only vendor that can run to you can see see how to do and we’re putting we’re putting those numbers together now and essentially my team has certain objectives as to what our TV CC and t PCH metrics are going to be then we’re tuning to that right now and I expect those metrics to be out within a couple of months max but I don’t have great numbers yet to compare but I do know that we’re hella hella lot faster than

Oracle out there on a much cheaper platform for most things one place that we’re not faster than or uncle on yet is on the native oci client that Oracle has to bring date again now our odbc connectors are slower but that’s something that will quickly Remy so stay tuned for benchmarks and comparisons from us but we we think we can crush anybody on the you know fast lookups range scans and updates t pcc types of operations i think that others will be faster than us on TV CH types of benchmarks that are full table scans but in the use cases we’re targeting I think our t PCH and you know full table scan type of analytical operations will be totally good enough for those those those use cases but will be slower then I suspect something like an Impala if we were running you know a huge huge model and a TV CH type of you know aggregation or join you guys that I don’t know but we’ll put out those numbers will find that out you cast once only get a new one let’s get back to you yeah follow oh okay oh he’s got the mic he’s got power is it possible to look yeah it’s a great question in the answer is absolutely yes where h catalogue enabled we’re doing lots of things with spark where some is on spark and some is on us or some other platform we’re leaving in our lab prototyping some of our internal join computation engines actually operating on spark which will have a completely different oh la king profile that’s a future from us but two answers questions succinctly and there is there you can do some work outside our systems some work in the system the compiler we just heard about that since over lots of system really intrigued me from that perspective I think I need to learn a lot more about that that’s going on but you certainly can push analytical operations that may be better on a column our representation or on some other MVP box into that world and join it with us so that’s that’s totally okay okay so compelling literally revisited histories now I think the priority will be quickly your okie right right so we use H basis primary pieces as our primary keys we have we leverage that mechanism that’s correct right so since history so sort support secondary indexes right how have you through winter that we implemented secondary indexes as dense indexes as additional age-based tables so so kind of like an alternative yeah and we have our optimizer smart enough to pick the index in the joint algorithm so that we get the that secondary index inside Joyce excellent so I think the question was everyone might occur it how do we handle splitty and compacting in HBase when we’re utilizing secondary indexes and enjoying our operations in general this is probably the most significant amount of innovation we had in our in our development because as we’re executing the single plan and our tasks are taking off the queue and start operating at any particular point HBase right out from underneath us might split a table right and in might or might trigger and compassion so we have many mechanisms within our task execution structure that can recognize these conditions perhaps some of them a timeouts perhaps some of them actually explicitly know what’s going on and that can trigger reach wise so that when the H files are you know we asked and the system is ready to be processed again and the region is available it just kicks in again let me mention also that different than some of the other architectures out there like Impala a long-running join in many of

the architectures out there if a node goes down that actually kills the whole credit but we have a very resilient a mechanism if a node goes down in the middle of a query and the region comes up on another region server our joins and long running queries will just simply continue on that other on that another platform they’re resilient we use we do materialize some of our intermediate results on HBase which has a performance impact but gives us essentially this resiliency yes to question first is how they handle all come free we have a mom home transaction the second one x is relaying the compassion so if you have transaction you are how about you do not do like the other these fasteners in doing compaction and those data has been blue around with a different region and you need to do how do those okay so let’s see the first question was the first question typing confidence right now so how they stack is not shot isolation you have no real complex you have to worry about these are not locked they can just do whatever they want they don’t have to worry about anything because they’re only reading in time stamp that many times they’re younger than then you know then make them basically later than that now writes have to detect right by conflicts so we’ve utilize these observer coprocessors in HBase to essentially on any right be able to introspect any transaction state and if you’re trying to write to a record with a time stamp that is essentially writing over a record that is in transition it’s state transition and the transactional state diagram is not committed or rolled back yet then we can detect a white-white conflict in which case we simply fail right now it’s it’s a optimistic i think it’s called so that the first the first transaction is going to get control of the second transaction low fat now the second question has the new of compaction compaction really doesn’t impact us at all I mean other than from a performance perspective and the reason for that is that every day that every data value in HBase in general has a time stamp associated with it we leverage those timestamps so those follow wherever the files go compaction is really you know just a function of taking two smaller H files and and basically playing together and the time stamps will be preserved the transactional integrity is preserved it just it’s just physically changing its place hbase will still be able to find it just like it did in the two page file system more efficient that answer your question yes the question always might throw so do you do anything with JSON glad I system restore the JSON or pushing and reassess it’s a great question do we do anything with JSON we have customers that will store json in you know our cars or some other representation but my goal is to actually extend our sequel processing with a json query extension so you can do some wild carding with dot notation we’re not there yet it’s not in our current version it’s in our road that will get there but I suspect that will make us you know very attractive to those that they have been enamored with the ease of use of a MongoDB but maybe been disappointed by its scalability I think that’s market that’s right for taking and we’ll hopefully people do that in the coming year or so but right now we’re very focused on traditional sequel Asia is empowering those but we will extend to support jason is a first class data data type in our system the neat thing is that dirty already cars is XML so we’re pretty close to be able to do things like that and have some extensions in there will probably leverage some of that but when opening the sequel JSON stone boot right won’t do yeah we might just adopt that yes

probably jamming query community for a cure QA where’s mommy took an application right so essentially we have a task scheduler within our system every time a a sequel statement is executed oh that byte code is executed that actually turns into tasks and the tasks are in cute and those q’s have priorities and we just execute through those cues it’s a fairly simple mechanism in the future we will let those that q logic actually reason about the user group enrolled attached to the tasks so that i’m t can manage their workloads and not let you know at a weekly report or ad hoc report written by an intern you know interfere with oltp queries and things like that right now it’s very simplistic it doesn’t do use a group roll sort of priority queuing so against it serves something like looking at name node today and becoming sensitive where the data is something queries out to just there well right now which is when I always be a question about the name go and looking in where the data is so you could be massively massively parallel in your words right but with right now the way we work is since weren’t based on age basis data representation it is automatically auto sharding the data across every unknown right every region survey of things so the computations take place on every node that if it’s if it’s a table type of computation alternatively if it’s for example a very specific operation on a filtered set of Records hbase is going to be smart to only execute that task on the nodes within your range of your of your of your index is that you’re working on there so it is it doesn’t it intersects based on knowing how the data is shorting already now an important future for us is getting really smart about assignment age basis assignment algorithms are fairly simplistic I think there’s a lot of research that we can do in making that smart and making it a shorting algorithms based on introspecting the sequel that’s going to run being smart about how to distribute the data to make sure that if you’re doing certain joins and you may or may not have indexes on those drawing columns are essentially partition in a ways to ensure locality we’re not doing that fanciness yet but that’s a future as well any other questions thanks this is fun I appreciate