Martin Kleppmann – Samza and the Unix philosophy of distributed systems

Just another WordPress site

Martin Kleppmann – Samza and the Unix philosophy of distributed systems

before I get started I’m trying a little new thing today I actually published a transcript of what I’m about to say on a blog post already before having actually said it so we’ll see how the two match up if you go to Martin KL comic slash UNIX feel free to tweet it or whatever you like as well you can follow along if I’m talking to you faster or something like that maybe that makes more sense than what I see in real time anyway I so I’m writing a book at the moment on on the architecture of data systems and in the process of doing research for this book I’ve kind of been going back into a lot of the old papers from the 1970s and 80s and just observing things we can learn from there realizing that actually you know they were not stupid back then they had a lot of really really good ideas and and in our field because technology is so fast-moving we sometimes have a tendency of assuming that everything that’s more than 2 years old cannot possibly be good which is a big mistake so what I’m going to try to do here is draw from some of the lessons of the development of Unix back in the 1970s and bring that into what we’ve been doing with 21st century distributed data systems so I’m a committer on Sam’s earth so I’ve worked mostly with that this talk is kind of about Kafka and Samsa and the interplay between the two so about UNIX tools so you’re probably familiar with this but let me just kind of give you a concrete example so that we have a common language to talk about say you have a log file for example this would be one line from an engine X log and it’s telling me various stuff like it’s telling me the IP address of the client who connected at the time at which they made a request and what the request URL was and what the response status was and stuff like that so if you want to figure out what this actually means you kind of have to look at the nginx documentation and this is the format string defining what one-line of the log should look like and so it’s got these various addresses the various variables and you can see how they match up so let’s for the sake of exercise now see if we wanted to analyze this kind of for ourselves well you could take a premade tool of course and just dump it in today and get some nice graphs out of it but actually what if you want to do the analysis yourself using just basic tools you could say say let’s say we want to find out the five most popular pages on our website so how would you go about doing that well first you have to take the log to somehow pick out the URL of the page that was requested and then you can see how many times how many requests so we go to orc our trusty UNIX tools and what oak does is takes its inputs line by line and within each line it splits it by white space by default and gives you variables like dollar one is the first white space separated component of the of the line dollar 2 is the second one and so on and so you can count how many spaces there are so if we want this typography dot CSS straighten come to expect that that’s Bala 7 that’s the second white space separated component so if we take prints dollar seven that will pick out that particular field from the log file and now what we can do is if we want to do this analysis we could change a bunch of commands together maybe you’ve written this kind of thing before it’s if you haven’t it’s really worth learning because it’s super empowering to be able to just faff around with a few files and get some results out with just a couple of commands chaining together so you start with having the access log on that looks something like this stuff so what we first want to do is extract out the URL which as I was saying that will now give you just do your L that we requested for each request in the log file and then well the next thing you can do is sort them so that all of the occurrences of the same URL are next to each other in the file and once you’ve done this you can put it through the unique tool which will take look at two subsequent lines in the file and if they are the same it’ll collapse them down and that C will add a counter in front of each so here there are two occurrences of the typography dot CSS to your current pop icon to your current sense of index dot HTML etc now if you want the five most popular pages we have to rank them by how many requests occurred so we can sort them again this time by the number at the start of each line so and we do it so the option gives us the number sorting at – argit’s it’s reversed so we have the highest number at the top and then we could just pick the top five results and I’ve got the five most popular pages on the website ok so this is the kind of style of Unix that I’m talking about and this is what they figured out back in the 1970s that you could have these

small tools like set and or Gretel and sorts and unique X arcs and stuff like that you can have these small tools which do one thing and do it really well and you can combine them you can compose them together into larger systems and the people figuring this out kind of just went about it experimentally but they realized after a while that what they were doing was building it a system based on certain principles and they kind of post hoc then rationalize these principles and try to phrase exactly what it was they were trying to do here and they explained this philosophy you know nice document this is a report from Kilroy and where they described their their philosophy building their sense they said it’s good to make each program do one thing and think really well this is like it’s been quoted all of the time but this is where that quote originally comes from so this is something like direct it only does the searching you know it doesn’t do open to other transformation things it just does that one thing sorting that the source program is actually really cool if you look at it you know it can handle datasets much bigger than memory without problem and then the crucial bit is you expect the output of one program to become the input to another program and that’s what this training is about the key thing here is that one program doesn’t care who is going to consume its output so this can be a completely unknown program this gives us composability which is really nice the way we compose these little programs that do one thing is well we join them together with our shell so with a shell we can then write pipelines something like this we pipe the output of one thing to put up an axe and the shell is the thing that built this wiring up what the operating system itself actually only does is provide this pipe so it provides a way of getting the output of one process into the input of another process so it’s just a syscall it’s a pretty simple thing actually but that’s really the key to what make this work it’s just a little background trivia for here this is for you this is a scanned page from an internal Bill Knapp’s memo from 1964 where McElroy is basically credit with credited with the invention of the pipe he described it back then they didn’t use the pipework pipe at that point they said we should have some way of coupling programs like a garden hose screw in another segment when it becomes necessary to massage data another way this is 1964 this is brilliant they said this is the way of IO also what you’re talking about here is IO redirection so it’s already there realizing that sending the output of a program somewhere but somewhere can be another program that somewhere can be a file that somewhere can be something completely different it really doesn’t matter we simply have is these little tools which take standard input and produce a standard output and maybe a standard error stream on the side and this is a really simple interface it’s such a simple interface if anybody can implement it so you can have a tool like orc which implements this and it can implement exactly the same interface and that makes it super flexible so when you build these pipelines or things change together or consorted you need concerned they are all provided by the operating system but you can inject your own code at any point in this pipeline where you want so now say with this web block processing example say you won’t get stats of requests by countries so how many people from a particular country have viewed this well the Internet’s log itself doesn’t contain country information it does contain IP address of the client and you can use the database to go from an IP address to a country so I guess you could use that but there’s a built in UNIX tools for hooking up an IP address empty locating it but you could use one of these databases and write your own and just insert it into this pipeline I know I mean this may seem painfully obvious if you work with this kind of thing all the time but actually this is amazing because what we’ve done here is we’ve taken bits of the operating system and put our own code somewhere in the middle of it and with the operating system is totally fine with that so your own code runs on equal terms with modules provided by the operating system and this kind of composability you don’t get with other types of applications like you can’t just take Gmail and pump it into elasticsearch it’s think but you know with UNIX tools we do kind of thing change things together all the time so this gives us wonderful expert next wonderful extensibility system is wonderfully extensible because you can just take any bit of the operating system that you want we use it put your own code where you want it and plug these things together in a very modular fashion okay let’s change C for a minute

that’s enough talking about Unix what about databases databases came out of at about a DIF about a similar time is this the idea of the relational model came up in 1970 so this is kind of concurrent with UNIX lot of database systems run on UNIX systems so you might be thinking that’s databases follow a similar design philosophy to UNIX does and that would be completely wrong they’re actually totally different so with the database as you probably know you have a server and you have clients and these are two very different roles so the database server just sits there all of the time waiting for requests from clients to come in the client sends a request to a database server and at some point gets a response back so it’s very transactional kind of relationship between the two so they are two very different things the client is running in your application code whereas the server is is this totally separate thing which you kind of operate but you don’t change the code so expensive ility of these database servers it’s very limited you get kind of some janky thou’rt procedure the query languages you know like oracle pl/sql and stuff like that with which you can kind of write triggers and stuff like that some recent databases will let you drive JavaScript as well as runs inside the server but that’s heavily limited what it could do like typically you can’t just make arbitrary if you request someone the internet from inside your database server but yeah with UNIX tools there’s no limit what the tool can do so here you’re operating within a database you can extend but only within very specific constrained ways added some stuff like custom storage engines you know will basically plug in API is so where the database server has foreseen extension points where the database authority has granted you the permission to extend the server in restricted ways you can do that but that’s very different from this kind of composability of little components where you can’t just pick and choose that we saw with the UNIX world so why can I not just take a database and pipe it into a search engine like there’s no fundamental reason why I shouldn’t be able to do that it’s just because people haven’t written the code to do this kind of thing so I feel that the design of databases is very self-centered every database thinks that it’s the most important component of your infrastructure and why would you possibly want to use any other data storage response back but it will not cooperate you have to force these things if you want them to cooperate they’re not composable in any meaningful way at all and moreover databases at least the kind of optical relation ones they try to do everything they try to have all of these different features in one monolithic application so you have on the one side the UNIX philosophy which says each tool should do one thing and do it well on the other side you have these databases which try to do like everything they tried to do transaction processing example text search and maybe replication and so on and you have all of these things crammed into one big application one big database server where if this was designed according to the UNIX philosophy these will all be separate things and they would be very loosely coupled you would just pipe one into the other and they would do the right thing so we have two here two very different philosophies we have this this UNIX philosophy where you have this very simple interface of standard and standard out and the databases which really integrated deeply integrated monoliths which give you very nice things like you know high-level query languages like C call is a wonderful thing but it’s not really extensible so let’s talk a bit about what makes this extensibility work the key to this is a uniform interface simply files and in UNIX a file is simply a stream of bytes that’s it it might have an end so because this is such a simple thing almost anything can be a file so obviously a file on the file system natural file can be a file a file descriptor maybe I should say but a file descriptor could also be a pipe to another process as I was talking about or it could be a UNIX socket or it could be a device driver like talking to your GPU or to your drive or something like that we were special device and it just looks like a file which you know does if you think about it that’s actually amazing that all of these different things can be mapped to the same abstraction you can talk to Colonel API

is through the proc file system on Linux which again just so you can you know pump some kernel API into grep and it just as does the right thing to some extent TCP connections look like files as well although arguably Dib BSD socket API so anything everything is a file or I should say a byte stream and this is really key because the one process is the same as the input to another process they speak the same language and it’s only because we have that uniformity that you can change things together if there were different languages it simply wouldn’t work so we have to agree on what is the way that bata gets out of one process and into another so we’ve agreed on ok father script is going to be ordered sequence of bytes you know don’t take that for granted it could have been a bit Street rather than a white stream stream of 16-bit words or whatever it could be some kind of structured records but anyway what they chose was white and and then you often need to give these couple things some kind of structure so if you have G set for example it won’t care what the structure of the thing is it’ll simply take a stream of bytes and compress it for you but other tools like oh and so on in order to do something useful they have to at least break up the file into records and the convention on UNIX there is that your new line is your record separator so each line is a separate record couldn’t beat something else but anyway shows often white space space is a field separator but there gets kind of vague so you start having to get a lot of input parsing tools out how to look at the x-man page and counted about six different command line options only related to how its input should be parsed so that’s kind of indicating that something is a bit wrong there so I think there a lot of great things about the UNIX design tools but they also made some mistakes so what would it look like if we took took the best of those ideas and brought it forward into the 21st century of distributed systems and tried to fix the mistakes as well so what I think UNIX did well the composability think really well so you can have these bits of your own code which run on equal terms with the operating system that’s really great it allows you to build things loosely coupled systems where you can evolve them gradually over time that’s great uses data streams so it doesn’t require you to have the entire result of one computation before you can start the next computation so you can pipeline things which is good for currency so great and it has this really good idea of simple uniform interfaces so that you can plug the output of one thing that’s also there are some problems though so I mean designed for running on a single machine this was before distributed systems were mainstream thing nowadays most of our interesting applications have to run across a cluster of multiple machines so we have to make this thing distributed another problem that I see with UNIX pipes is that there are one-to-one communication mechanism so you’ve got exactly one process producing output exactly one produce taking data input you can’t do kind of broadcasts they live broadcasting something to multiple processors you can’t collect stuff from multiple inputs actually you know if you’re running application servers your web servers on a whole bunch of machines really you do need to collect the inputs from all of those events so as I said the byte stream is a very low level abstractions so it it would save a whole bunch of pain related to passing input if there was some kind of higher level data model there and finally a problem with I see with UNIX pipes is that there’s no full tolerance really because usually if you write this pipeline of things it’s expected that it’s going to run for a few seconds and then quit and that’s fine like you don’t expect that a pipeline can have one process in the middle crash pipeline and continue rocking that’s fine if your pipeline runs for a few seconds but like our applications nowadays assume forever basically and in that case you really do need the ability for bits of the pipeline to go down come back up again generate or while and this is basically what Kafka Samsa do so I spent all of this type time talking about UNIX but actually all of the stuff I talked about directly carries over to the world of Kafka and Samsa so I’m just going to quickly show how that analogy works so if you built these kind of pipelines of changing things together you can kind of zoom in on what bit of it and what I see here is basically message brokers and stream processing

jobs so I see these pipes these UNIX pipes you could take to a distributed set up multiple machines and it looks somewhat like a message broke up which allows you to get messages taken from one side over to another side shipped messages around it’s a transport mechanism basically and these commands that run their day somewhat like stream processing jobs they take events in on one side produce events on the other side and maybe do something stateful or maybe update some state maybe make requests to database whatever but I think this analogy works reasonably well what about the downside Sutter so I said UNIX pipes a single machine well okay we can take this with something like Africa we can take this to a distributed set up scaling across multiple machines works really well millions of messages per second no problem the issue of one-to-one communication well in Caprica a topic is not a one-to-one communication channel it’s a published subscribe channel but as you can have many processors like all of your web servers all pushing messages to the same stream and their messages simply get interleaved in the way that you expect on subscriber side you can have various different applications consuming a stream of messages and they don’t interfere with each other so you can just spin up a new consumer listening into an existing stream and it’ll just be a bit there quite comfortably it doesn’t interfere with anyone else who is also consuming the stream of course fault tolerance yes that’s solved so Kappa code will replicate messages across multiple machines so it can survive the failure of individual machines and then with stream processing jobs like like with sound so their recovery semantics built into it so if you have a job running on a machine and up machine dies the framework will provide tools to help you bring the job back up again on other machine without polluting any work so without missing message input messages and so on and then finally this issue of input parsing Kafka itself has only a fairly low-level data model so it’s not a stream of bytes as UNIX pipes are but a stream of messages and each messages each message actually is just an array of bytes so it’s just slightly moved it up it’s basically saying like okay in UNIX they’re separating records by newline thing pretty much everyone has to split a stream of bytes into some kind of record multiply a million for example so why don’t we just built that into the transport channel and by the way that just gives us a nice wave interleaving messages from multiple publishers as well we tried to interleave bytes from multiple senders in a byte stream you get a right old mess but with messages you could do that in a sensible way on top of Kafka then people can build can use whatever serialization like Jason or whatever whatever you want many people recommend to use a half row in that context which works really well with schema management I won’t go into too much detail with that but basically just a little bit of a level of abstraction really helps there so I think this analogy of UNIX pipes and Kafka works reasonably well well I should just highlight what the differences are to make that really clear so I mess it mentioned in extremes are white streams difference is that UNIX pipes just have a small in memory buffer so the output one process gets put in a little before it then gets shipped over to the other but that’s that’s just a few kilobytes typically on in in Category everything actually gets written out to file and so the amount of data that you can offer is only limited by your disk space which means you can buffer much much more the the fact that it’s written to disk is kind of also useful for durability Circa Survive server restarts that’s only part of it the other part is also just being able to buffer Lots so if you have one consumer that goes slow well what happens in the UNIX world if you have the consumer the pipe goes slow it has to block the producer it’s only got a small fixed size buffer so like I don’t know say you’re piping cutting a file into GZ it will probably be slower than the Khattak the file so the G set will simply consume data at the rate that it can and so you’ve got this back pressure mechanism there which back pressure in multi subscriber streams really doesn’t work very well because you can have one consumer disrupting the flow of messages for everyone else so there’s no back pressure in Kafka but instead every consumer of a stream just consumes messages at its own rate and because Co can buffer a lot of messages that’s no problem but four weeks worth of messages that means a consumer can go down for

five days you can bring it back up and in the remaining two days it can still catch up thus that’s quite nice I’m final comparison point here as I said yes multi subscriber versus one per month so apart from these differences I think the analogy of Cathcart UNIX pipes works quite well so one detail for example which is similar between the two is the ordering aspect so in UNIX pipe this it seems kind of populist but I’ll just say it if you put the bytes in in a certain order you get the byte back out again in the same order because otherwise and this is a very useful property with something like say a database replication or good right ahead log where you want to be sure that the writes in the log appear in the right order because some you know if you overwrite if you set some value to one and then set it to two you want to make sure that two happens after the one so the ordering is an important factor and Kafka has these same strong ordering semantics as a pipe does which is different from say AMQP or JMS start message brokers they will quite happily deliver messages out of order to consumers so there they’re good reasons for for ordering or not ordering but in the case of where you’re thinking about your your system as these UNIX pipes and jobs join together the ordering semantics are actually really important so this is basically the analogy I’m making got a UNIX tool which takes standard input/output you’ve got a stream processors such as a job running and Samsa will hear more about Samsa from the next tooth pork so when talk about it too much but basically it takes input occur topic and produce multiple topics evening joint together and produces as output you know maybe one or more other graphical topics or maybe it writes to a database or whatever it wants to do so the general pattern here is this really you’ve got some stream of data coming in and some function operating on it to produce some derived version of that data and this aspect here is quite important like if you think about Grandpa for example grep doesn’t modify its input file it just produces something that you can send some well to make it to another file or some else but it’s it’s not going to modify the input file that’s really important because that allows you to run grep as many times as you like without destroying something in the input you get the same properties with stream processors where you can just run something experimental and send its output to a temporary location and you’re not going to mess anything up for anyone else it’s not like that with databases typically typically with a database you know you’re going to have an application and it’s going to do some transaction that’s going to do some mixture of reading and writing so if it writes you’ve made a change there so you don’t have this kind of functional property in system so that’s something I find very appealing in stream processing is that these streams of data much more immutable and you’re just thinking about deriving some data from some other data which makes it scale much better to a large organization you can think of Kafka almost as this kind of transport mechanism for various teams within an organization like there’s one team which is I don’t know handling web requests for a particular type of transaction from customers and whenever something interesting happens they just publish it to Kafka they don’t care who consumes it they just publish a message in there dump it in back and so maybe this other team that picks it up from Kafka and maybe joins it together with some other maybe database something like that maybe it produces some derived streams which are maybe interesting but it doesn’t care who’s going to consume those streams again is just published out there and anyone who wants can pick it up and so it has the same loose coupling property as we get on the UNIX side where the one job that’s producing some doesn’t need to care that’s different from making a service call for example like if you want to receive a call from some upstream service every time something happens you have to change that service to where as here with these they’re the subscribing to a stream is a totally passive thing so we’ve got here this is what we started with this idea of UNIX that you’ve got different programs which do one thing and do it really well and I think this is quite similar to Sam’s of jobs that we’ll see in the next two talks where each job basically specializes in doing one particular part of the pipeline and then downstream takes the output from that and does the next step so you break a complex problem down into smaller parts so easier to understand and you join them up through a uniform interface which means that you can start branching the pipeline in unanticipated ways for example if you want to add some

monitoring or order take somewhere on the side you can do that without having to change any of the existing code I’ll just happily sit down the side consuming a stream that’s being produced anyway and thus this really helps scale applications to these large complex systems where you’ve got various teams needing to collaborate in some way so I think there lots of good lessons to be learned from the 1970s about it here’s a list of things especially this top one here is just a nice description of the UNIX philosophy and quick for my book if you’re interested in this sort of thing you can find a draft of the first eight chapters so it’s not finished yet it’s kind of like sweet hopefully I’ll finish it at some point