How Parse Built a Mobile Backend as a Service on AWS (MBL307) | AWS re:Invent 2013

Just another WordPress site

How Parse Built a Mobile Backend as a Service on AWS (MBL307) | AWS re:Invent 2013

my name is charity majors and the tech lead for infrastructure and operations at parse I’ve been apart for a while basically since we first opened it up to developers in beta and I’m here to tell you how we built a mobile backend as a service entirely on AWS so how many of you here are familiar with parse I’ve used parse nice and how many of you are mobile developers couple how many of you are more like infrastructure II folks okay cool well this is really a talk for you guys and if you’re cute if you’re like a mobile developer and you’re you’re curious about pars like there’s stuff in here that you might find interesting about our stack but like this is really a story for the infrastructure nerds about how we built a robust and scalable platform on top of AWS so what is parse it prices a platform for mobile developers we supply we support iOS Android and Windows devices to the extent that there are windows device developers out there we do support them we have a rest api and native SDKs for a bunch of different environments and languages we have a lot of different products and features in our platform so like a very partial list to be client analytics cloud code file storage web hosting push notifications etc so that’s like half the pitch right we have great features it will make your life so easy and the other half of the pitch is we really do everything for you on the back end we handle the databases we generate your schemas we handle all of your indexing and all of your performance issues you’re scaling your data modeling your storage or user management and third-party integration we do it all so basically if you’re an app developer price is like magic all of these really sticky scaling problems the things you need to hire an ops team for the things that go boom in the the night like that’s our bread and butter if your app gets featured in the itunes store in the middle of the night and you wake up and you literally have a hundred times more traffic as when you went to bed like that’s a really great thing to have happen to you but you can’t really plan for that so we deal with this every day so app developers don’t have to and this frees you mobile developers up to do you know whatever it is that makes your app really awesome and different you know your user experience your UI fancy features i don’t even really understand what it is that mobile developers do but like that’s you get to do well i do what i do so you have to work on that i get to work on genetic generic click back in problems that i find interesting so parse is completely hosted in AWS we have never touched bare metal our trajectory over time so far has to be event to integrate more closely with awss higher level services not less closely and yes we were recently acquired by Facebook in May and no we have we get asked this a lot d are you planning to move off of AWS we don’t have any current plans to move off of AWS you know the other big Facebook acquisition Instagram is currently in the middle of trying to migrate on to Facebook’s infra so I mean maybe someday their arguments for arguments against but like we love AWS it’s really not on our radar at the moment one of the reasons AWS is so good for us is that parse is growing insanely fast by pretty much any metric we’ve been growing by about five hundred percent per year for the past two years so like here’s our number of apps developers using parse we’re up to 180,000 now and you can see how that curve is increasing ever since we got bought here’s the number of easy to compute units over the past year the number of 200 requests the number of connected established Android device connections at any given time is apt about 25 million devices phoning home so you know graphs graphs whatever this is a sample but like in addition to getting just plain larger and running more traffic price is getting rapidly more complex like every few weeks we come out with some big new feature or backing capability this is a copy of our adorable little infrastructure diagram from about a year ago we’ve got like every host drawn out and here is the current diagram or honestly this is current as of about a month ago when I drew it up and there’s already like five or six things that should be on it so

given how quickly we’re scaling elastic provisioning is totally key for us if you would have asked me a year ago guess how many servers are going to need a year from now and guess what services you are going to have deployed and guess which databases you’re going to be running I I can’t tell you what I would have guessed but i can tell you i wouldn’t have been right and if you ask me the same thing now about what the hell are we going to be doing a year from now like I can spitball some but it’s I don’t know so like I said ops kind of the key pillar of our service we’re a platform we have a hundred and a thousand apps that are relying on us to stay up we can’t just like take the site down to run a migration and somebody’s honestly parse is kind of like the ops problem from help because of all these shared resource and multi-tenancy issues and because we’re letting random people off the internet run queries on our databases and arbitrary code and our stack so it will be super easy to let things just drift into a state where we’re fire fighting all the time we have a pretty small team we only have three systems engineers and we’re all full stack generalists to everything from all the DBA work to writing code be on call cetera we like to take vacations we also like to go home at 6pm so it’s super important to us culturally to do things right we really we feel like we’re doing a good job if we’re spending about twenty percent of our time reacting to problems and eighty percent of the time you know looking around and deciding what would be cool to build next what would make our lives easier a month or two from now we have it always been this way a year ago we were really suffering we spent anywhere from fifty percent to a hundred percent of our time just reaction to incidents we were working really long nights getting woken up all the time we were all pretty worn out and we’re like god damn it this really blows so over the past year we’ve really consciously iterated towards infrastructure that cares about services not about individual nodes it could survive most minor or you know some my major incidents and it degrades gracefully instead of being all up or all down and to do this what we’ve done is we’ve shifted a lot of that manual work into AWS and let it do the work for us our infrastructure is at least five times as large and complicated as it was a year ago but we get paged about a quarter as much and many weeks we’ve never even get woken up at all so I want to do two things in this talk first I’m going to walk you through the whole back-end architecture of parse I’m going to talk you through the API pass the push infrastructure the rest of our architecture diagram so you can see what kinds of services we’ve built what kinds of backgrounds we deal with what kinds of databases we use i’m going to talk some about the technical discussions involved in building and growing parts and after that i’m going to tell you the story of how we grew up from an architecture that was very fragile and took a lot of moving parts and manual labor too much more scalable and resilient service there are a couple things that I hope you will come away with at the end of this talk number number one you’re going to think that auto scaling groups are amazing in my opinion you can either just say it I’m going to hire a couple extra ops monkeys and they’re going to do nothing but push buttons all day or you can really commit to auto scaling groups also i want you to think about why your automation should be reusable it’s not enough to like every we’ll just automate everything you know well it’s not enough to just write scripts to do the same stuff that you were doing before if you have a really complex infrastructure and you’re constantly adding new services you need to be able to reuse the work that you’ve already done to automate other services and last thing that I want you to come away with is an appreciation for how important it is to choose your source of truth carefully we have had three different sources of truth over the course of the last year get chef and zookeeper and there is not necessarily a right or wrong answer here and the right answer is kind of zoo keeper but depending on your size is not necessarily a right wrong answer here but you should definitely understand how each of them comes with its own set of compromises and trade-offs so let’s talk about the par stack again this is our only mildly out-of-date architecture diagram before we really dig into it I want to just run through some top-level architecture goals some design choices that we made kind of from the beginning number one we use chef we love chef as a team I would actually say we are not generally let you cut wood called dogmatic or religious about tools be

because you know whatever that they all kind of stuck in their own special way but we really value the chef’s community ethos there’s this really strong focus on collaboratively building shared in reusable cookbooks that work across lots of environments it’s great because it means that the community tools are robust lots of people have run them and you have a whole tribe of open source folks who are working on improving them at all times I really think that the chef community is a single biggest asset we also believe in route 53 and using real host names now I personally don’t ever want to manage DNS again not because it’s hard because it’s boring like how many DNS distance if we all managed over the course of our careers it’s boring it’s a solved problem I don’t want to do it anymore just give me an API and let me be done with it I also I don’t really want to get paged or emailed about instant I 05 blah blah blah whatever I want to know what it does I want to know what I’m being alerted about if I’m being alerted about something so I want a human readable hosting obviously because we’re needle us and it’s a no-brainer we believe in distributing across availability zones and doing automatic failover for every back-end possible we are currently only in one region and we have no plans to do cross region replication for reasons I can go into later whatever if you want but cross availability zones da2 told you that and we also believe in having a single source of truth about our systems at all times a couple of East to design choices we’ve found it’s best to standardize on a few instance types we use primarily em win that large m12 x L&M 2.4 XL since we use chef we can’t actually use any instance type that doesn’t have at least two core because chef will always peg one core every time that it runs limiting the number of instance types makes it easier for us to deal with reserved instances although I am super excited about that new feature that lets you actually convert reserved instances in the same family you guys heard about that so sweet and we prefer kind of conceptually to use lots of smaller disposable instances if you can call em when dot large or small I guess you kind of can these days instead of fewer numbers of large instances we are also devoted users of security groups we are still on ec2 classic nothing talks directly to the internet from from the internet to a student senses traffic is only allowed to ingress through lbs each host roll gets its own security group with minimal ports opened and we also have an idea check that periodically verifies that the running security group met config actually matches what’s checked in to get so starting at the top of our api path we’ve got an ALB handles all inbound traffic hands off to engine X which serves some static file and routes to the ruby rails servers and there be app servers drawn unicorn which is this pure Ruby HTTP server I’m not going to rag on Ruby its treated us really very well through our babies start up yours but we are really at a point where we need real threads and real asynchronous behavior because basically if a single back and gets low right every single unicorn threads going to fill up earth with requests to that back in in no time flat and that’s really our single biggest threat to reliability at this point so we’re rewriting the API server from the ground up starting with the drivers that talk to the databases so currently the way it exists it’s kind of in this limbo state where some requests get proxy from the Ruby API servers to the go API service to the back end we also pipe some logging events into a go service that proxies logs over into facebook subscribe ecosystem for like aggregation query time analysis real-time data loss prevention performance analysis that sort of thing we have another service written go this provides the parse hosting product it’s effectively like a fancy wrapper for cloud cotinus 3 which together lets you serve a very fully featured website off of parse we have a couple alas I don’t know if you know this or not but like most register registrar’s won’t actually let you see named the apex domain so we set up a couple of elastic IPS and people can give their antics domain and a record the points of those elastic IP is and then those will just like redirect your request to the parts hosted servers so it’s kind of a cute hack um quad code claude claude is server-side javascript and a v8 virtual machine and this is a pretty cool feature this is the kind of thing that is really hard to do right because you’re letting people ran run them random stuff on your servers right so it’s inherently kind of dangerous but cloud code you can upload snippets of JavaScript and then call those snippets

with an API request including like triggers to execute snippets like before save after a save sort of thing we have lots of third-party modules here for like stripe twilio melgen good cetera and we just added long-running jobs so like normal cloud code snippets have to run in under 15 seconds long running jobs can run continuously so that’s like a parallel Bank of stuff and then there’s our whole push ecosystem we handle push notifications for iOS Android Windows we send literally like billions of pushes per month an individual app will sometimes sound like 10 million pushes a day we run around 700 pushes per second steady state and it spikes about 15 X to like 10k per second our push infrastructure is rescue on Redis rescues just Ruby queuing library we’re also planning I’m substantially rewriting this whole stack and go the way push basically works is the expansion key takes a total push jobs from the API server explodes them out into iOS Android and Windows device jobs and play some of those cues so like there’s a certain amount of added complexity there for different qos requirements fast Q slow queues isolated jail of what not but like that’s the main potatoes of it we also have this interesting beast called PPN s which is the parse push notification service so it’s like apples apns but for Android devices will be augmenting this with GCM soon but GCM has not been around for long and not all devices use GCM so we still have to do this what the service does is it literally holds open a socket to every Android device device that is phoning home for push notifications we currently have about 25 million simultaneous Android device connections and 200 million Android out boxes we have to run this on a lot of servers unfortunately the Ruby event machine library can only handle about 250,000 connections on an m1 dot large this is not even like a kernel limitation or anything this is just the Ruby libraries limitations but we actually benchmark this exact same service using go and we can hold open up to 1.5 million connections on a single m1 dot large minor colonel doing so we’re pretty excited about this we also use P DNS for PBS and this is the only place that we do serve our own DNS we used to route 53 just round robin all of these iPS but we literally outgrew the UDP packet size when we cut over like 25 hosts and if the UDP reply fails the clients are supposed to retry using TCP but not all clients do this correctly including a lot in Russia and Thailand like for some reason like all of our customers in Russia and Thailand were just like suddenly unable to resolve pushed up our song so instead we delegate just that single a record to our P dns server and it just returns a randomized subset so that’s like the highlight reel of our services well let’s talk about some backends we have this running joke parse that we run all the databases we actually only run MongoDB my sequel Cassandra Redis and hive so no big deal usually you talk to people about best practices they’re like yeah you should really choose what databases stick to it and we’d love to but we do a lot of different things with data and there’s just no one-size-fits-all solution for all the things you want to do and honestly like you guys know this dealing with databases and AWS is still kind of hard it’s not like you know you sirs oriented architecture every note should be disposable that’s great for your services that’s absolutely true but the whole cloud philosophy of elasticity and disposability is a lot harder to do when you actually care about your data so what this means for us is that instrumenting your databases and your backups and your restores and your provisioning and everything is really more important than ever so let’s talk about DB is our workout workhorse we don’t have the largest deployment in the world but we definitely have one of the most complicated we have about 15 replica sets three to five notes for replica set 2 to 4 terabytes ripples that most of these rebel sets are storing just structured application data for our mobile apps we also use for some other stuff but that’s kind of irrelevant because the use cases are not as inter but we have because we have out of 80,000 apps we have over a hundred and eighty thousand schemas in user data collections alone so that’s over a million collections which if you talk to the Tengen engineers it’s really not the way manga was designed to be used but then again there really is a database that was designed to do this so we did some pretty cool stuff with to

allow us to manage all this data we do some intelligent Auto indexing of the keys based on the entropy of the values we compute and generate compound indexes by analyzing like real API query traffic we implemented our own app level sharding because the built-in charting would not work for us for additionally complicated reasons which I can discuss later if you’re interested we also use arbiters to provide stability like I said we use manga with a lot of edge kc ways so we have run into scenarios where all the secondaries will die at once after that happened once or twice but we’re going to use arbiters to manage our boat so that this doesn’t happen anymore in terms of resources we use primarily and 2.4 excels with strike provision I ops volumes so like the main constraint for scaling wise is that the work is that must fit into memory so we just go for some like big memory instances when we were looking forward to trying out some of the new instance types it’s pretty exciting our lives are pretty miserable that can we were running in class BDS but for rigid I Ops has been rock solid for us we love it like pie ops really made databases feasible on AWS in my opinion I don’t know if you saw ilias keynote this morning where he showed that awesome graph of like what our latency looks like with all the spikes and it’s like using classic EBS it was terrible and then the same graph after he moved to PI ups like average latency just dropped in half spikes all smoothed out they’ve been amazing and again like going back to the subject of instrumentation we’ve got a lot of work on the and AWS chef cookbooks we wrote some stuff for the AWS cookbooks let you provision and assemble EBS raid volumes are there from scratch or from snapshot we implemented pi ups we implemented EBS optimized and on the manga side we have fully automated snapchatting provisioning so basically we can bring up a new node from snapshot in five minutes and that is saved us I didn’t even know how many hundreds of hours over the past two years we use memcache obviously this is not really a database but i thought i’d throw it in here because it’s an example of a design choice that i would not make i would definitely use elastic cash next time we also use Redis like i said in the push infrastructure it’s mostly just as a back-end for queuing using rescue Renesys like somewhat limited and its applications but it’s really been amazing for us it’s like it does one thing and it does it so well only like two or three times a year do we even really even have to think about our Retta stuff so usually when we run up against that single thread CPU constraint and then we have to think about how to logically separate the workload yeah redis we are not using elastic ash Redis but I am very intrigued by it reticent my sequel or obviously our tube a kids that don’t have a solid automatic failover scheme so we’re still working on them another really interesting option here is twin proxy or nutcracker you guys use that it’s really cool it’s like it’ll sit in front of a whole bunch of reticence tances and do this proxy for you it helps us like gracefully failing stuff unfortunately we can’t use it because we depend on rescue scheduler which uses with multi command but maybe you can and then there’s my sequel we do use my sequel we would love to get rid of it unfortunately we can’t get rid of it till you get rid of rails because active record just freaks out if my sequel goes away for two seconds we’ve waffle back and forth on whether to go to RDS or not the fact that there’s no chain replication is honestly kind of killer it’s also kind of a black box if you can’t get my sequel show it’s hard to see what’s going on and like our whole team has been running my sequel for god knows how many years and it’s it’s kind of challenging to want to give up that control so the current plan is basically we get we get rid of rails we get completely on go the my sequel to RDS so we get auto failover because once you don’t have a corrective record our stack is going to be a lot more tolerant to service blips not that that ever happens lastly we also run a 12 node Cassandra ring and this power is the parse developer analytics process the columnar store super fast and efficient for rights and increments pretty slow for reads so like it’s the exact polar opposite of but this is like exactly what you want for an analytics product product we use a femoral storage here that’s great it’s free we also like for instrumenting Cassandra we took a slightly different tag instead of going the the chef route we use netflix is Priam which brands cassandra inside of an auto scaling group and it manages some of the more painful things about Cassandra like initial token assignments backups and incremental backups to s3 etc Priam not

super easy to set up not super well documented but it’s totally worth it if you have an underperforming node like you look at your getting a grass you like you know read latency is really really slow and this one no you just kill it premium restores all the data and adds Cassandra back to the ring and we actually did a full rolling upgrade of all 12 nodes from my ex l2m 2.4 XL and just a day we just like killed minh no you’re not out for the other it was awesome so that’s the parse architecture now you know what our services look like and all the various backends so this brings me to the third section of my talk this is my favorite part this is the story of how we went from an infrastructure that you was managed in kind of the old fashioned way the traditional way slowly and painfully within with hand manage lists of servers to it infrastructure where nodes are pretty disposable self registering self disposing when they’re sick and I’m going to illustrate this process just by showing step-by-step how we dropped the length of time it takes to scale up any individual service from like two and a half hours to like five minutes mmm so when we started off as a little baby startup parce we were not building primarily for scalability or reusability or the best possible operations best practices because that would be dumb right you’re a startup most startups fail one of the worst things you can do is waste your time over engineering every little thing over engineering and this time is just it’s a waste of your time it’s not going to get you where you want to go you don’t actually know at this point if you’re ever going to have any scaling problems so our first generation infrastructure is like a lot of Ruby and rails be You chef to build base am eyes capistrano to deploy code and I’m not going to say a cap is like the greatest or cleanest deployment tool in the world but it fits really really well with the Ruby world and for registering services and lists of node we just use yamo files checked in to get so this is a real world or we have like you know 20 to 50 total instances and maybe once a month we need to add or remove something and it’s a pain in the ass so like these are all the steps involved for just bringing up 20 new hosts for anyone service under a primitive infrastructure like this you run 20 nice easy to commands to bootstrap to new hosts and you can parallelize this with some stupid shell tricks but it still takes a while and you have to manually figure out in the sign and my name the role the host names distribute the instances across availability zones etc and you add all these new hosts aims to cap the cap deploy file using the crazy easy to domain names you can deploy from externally because we’re still using at sea host at this point then you add all the new host day into the ammo files push again with Jose get head for the nodes to come up you do a cold deploy which involves pulling down the entire github repo on all your nodes which takes an hour two hours depending how big your repo is and also get double throttle you if you have a lot of nodes we run chef client to pick up the changes to etsy hosts and then finally you like construct a mental list of all the services that need to talk to these new nodes and you do a full deploy we start oh it’s exhausting just like listing listing all the things you have to do and although this process takes like an hour and a half two and a half hours depending on the size of your repository and that sucks especially said it’s at least half the time if you’re bringing up 20 new servers it’s probably you need them now because you’re probably out of capacity so and besides like this super glaring obvious issue of how much time it takes there are some other more subtle landmines here someone has to babysit this process it will be bad enough if you’re just kicking off the process it’s single command and then you walk away you do other stuff you wait for it to finish you come back but now you’re interacting here with several different systems that can’t easily communicate with each other so you have to sit in the middle everything one to finish so you thing two and so on and this is a terrible use of an engineer’s time another problem you’re maintaining lists of machines by hand the same lists in multiple places this lends itself typos and this lends itself to lists getting out of sync with each other you’re using the long non-human readable easy to host names you can’t even really tell at a glance what you’ve added or what you’ve removed or what you’ve deployed to it’s also a big mistake to require a full code deploy just to add you know to change the list of hosts you’re trying to talk to I mean deployment best practices you want to make as few changes as as possible at once so that you can easily

isolate any problem if you’re adding a few hosts you want to just add the hosts if you’re deploying feature just deploy the feature you know and if you’re doing my sequel migration for God’s sake don’t do anything but that migration and also this this terrible process it also requires humans to remember things and make decisions like which set of services needs to know about these new hosts this in general is a terrible mistake acute because humans are really bad at this so this old original system breaks all these best practices fine if you’re young you have a small set of hosts you’re not doing this very often whatever but if you’re doing well and you’re a little bit lucky you grow out of the stage pretty fast and this is basically where we found ourselves at parse about how to year ago a little more so the second generation of our infrastructure consisted primarily of moving our source of truth about the universe out of github and into chef so instead of maintaining list of file list of hosts by hand and ya know files we generate these llamo files from chef using roles so that they get updated automatically every time that chef Ron’s we do the same thing for a tree proxy configs you know anything with lists of hosts chef Jerry it’s a list excludes any hosts that Lea specifically excluded and so that we do have a way of the movie notes from service we started registering the host and with route 53 and chef so every time chef Ron’s it registers the host name and a special internal domain so this infrastructure is it’s a big step in the right direction we don’t have to do full deploys to add services we’re generating ammo files and we only have one set of files to maintain by hand which is the Capistrano files so to bootstrap new hosts you know you run the knife into two commands you add the hosting to the cap deploy file you generate the EML files and then you just restart the new services and you know this is better we’ve reduced the complexity and the list of steps it still takes a while we were growing pretty fast and we really hate doing things by hand so this is this is the part where we like sat down with all right what are our goals here and this is what we came up with for our goals first of all we needed to be able to scale up any single class of services in five minutes five to ten minutes tops this is our criteria now for some people ten minutes is going to be too long you know they have super bursty traffic whatever they can’t wait that long and it is possible to optimize this process down to two or three minutes but that’s not really where our priority lives at this point with 180,000 apps my traffic is not that spiky and we maintain a certain amount of over Persian overhead to absorb unexpected things we don’t want to take 30 minutes to scale up a service because that’s annoying and it interferes is our workflow but the five to ten-minute range totally fine next scaling up should be one command we really wanted to take human judgment out of the loop it’s so easy even for good engineers you know to mess up just a series of commands and judgment calls it’s pretty hard to up one command also we never want to ever ever ever have to maintain another list of hosts ever we want to automatically detect when new nodes have been added or automatically removed down notes from service and we needed to be able to deploy fast including deploying from master so this is kind of a key constraint for us we’re still pretty start-up eat sometimes we need to get a fix out fast we don’t have time to build a new am I we don’t even always have time to like run tests so there’s obviously a big trade-off here if you build an ami for every deploy you can scale up incredibly fast and you can get your nose into service in just a minute or two but the process of building that am I in preparing for the deploy could take a couple hours so we decided this is a trade-off we are totally willing to make think we would rather be able to deploy fast and deploy from master so we’ll eat the cost of doing a few extra minutes of configuration every time we bring up a bank of new nodes and finally we knew they were going to be moving from the Ruby world to the go world so we decided to design a new deploy process from scratch something that really makes sense for go now go has statically linked binaries that are really small and ridiculously easy to distribute so makes no sense to pull down the entire get every boat under heavy machine it also seemed like building an ami every time it’s really overkill when there’s just one binary to deploy so those are our goals we knew we wanted to use auto scaling groups I came to reinvent last year and attempted attended a couple of the netflix talks they were talking about their auto scaling groups and I was just like oh my god I want that so bad so we decided to use a generic a my image and then all

the role specific stuff actually still gets laid down by Chef when it’s bootstrapped each ASG is named chef role so it can infer how to bootstrap itself and like what to put in at chef client Darvey from the role name we already use Jenkins for continuous integration and we love it so we decided to just have Jenkins generate a tarball art artifact after every successful build and upload that to s3 it just gets billed with it just gets tagged with the build number and dumped in a bucket named after its service then we wrote two pretty awesome utilities is yeah my coworker died too Mike wrote them and we are I think working on open sourcing them yeah yeah okay the first one is called auto bootstrap it’s a script you run out of in it on the first boot it guesses the chef roll from the ASG name it generates the client Darby and an initial run list it sets the host name by pulling zookeeper for the base name and then it appends the next available integer and then it registers itself with route 53 and it grabs a lock from zookeeper well it does this so that the DNS registration is atomic and you’re not going to collide with our host names that are trying to register at the same time but at bootstraps run chef configures itself and that’s that’s where the extra four to six minutes goes it’s that initial chef Ron that does all all of the system configuration installs a bunch of packages below up and when bootstrap is done running chef it runs the auto deploy scripts and the auto deploy script just pulls down the latest build artifact from s3 unpacks it copies of config files and starts a service so we have now basically taken care of how to scale up services really fast and how to deploy code really fast so what about service registration and discovery well this is where zookeeper comes in zookeeper is this distributed coordinated system that a bunch of other people you guys probably know what zookeepers it’s not always super trivial to set up it can be a little confusing and opaque and the quality of the client libraries varies wildly but it is pretty much the only way well it’s one of the only ways to get a consistent source of truth the currently reflects the state of your system in real time at any given point and like one of the cool things you can do you have plants come up they register an ephemeral node was to keeper and then you can just query the list of those if everyone knows that it gives you the list of all the active clients it also does some things like distributed blocking allocation of unique IDs and so forth we really love zuber wherever possible we bake it into the service itself so the service starts up registers itself is do keeper and if the client dies ephemeral no goes away and D registers automatically in the places where we can’t bake it into the client we’ve written some scripts to perform these utilities we have a zookeeper node registry service that will detect with a local services up and establish the ephemeral node is you keeper and we also have scripts that generate yamo files from the zookeeper node information and in watch our script that will kick the service if a yellow file has changed we also use it from lock coordination so that like only n indexing jobs or badge increments can be run on a given shard at any given time I should also point out we migrated from yellow files to zookeeper very carefully very slowly and incrementally and the first thing we did was instead of generating the Amal files from chef we generate them from both chef and zookeeper and they would perform a bunch of sanity checks and after a while and we were confident that the zookeeper information was reliable we switched to using the ones generated by zookeeper we also wrote some code to integrate these zookeeper and the ases with Capistrano so instead of having to maintain that list of file names in the deploy file we just stay with an environment variable that we would like to do query either a zookeeper for the list of nodes or the auto scaling group for the list of nodes that was our last hand maintained thing he’s kind of stuck at slides all right so sum up the third and current generation of our infrastructure it has these characteristics we have some go services we have some Ruby services we still use chef to build the am I and maintain state we have one st parole we use Capistrano pleasu keeper Plus Jenkins + s3 to build deploy code and our single source of truth for the state of the world is zookeeper so we have no list of hosts to maintain my hand instances get added and removed from service automatically so let’s go back to our example to bring up 20 new nodes all we do is adjust the size of the oddest guilty group let me sit back have a drink Oh machines do all the work

like five to ten minutes later our instances are in service running the latest good yay so are we exactly where we want to do now we’d still like to set up some cloud watch triggers unlike available app server threads so that we don’t even have to type in the single ASG command to scale up this will be like lazy ops nirvana right it’s gonna be amazing we haven’t done it yet because honestly Asda’s have not been particularly useful for us for dealing with bursts our births tend to come and go in two to three minutes which is not enough time for a FPS to really respond and fulfill those requests so we mainly use a SGS to respond to lengthier trends and we rely on a certain amount of slack capacity and application burst limits to protect the API for the short term bursts we also don’t really have any like significant periodicity to our traffic I’m like either a daily or you know time of day week to week whatever so it hasn’t been a high priority we also need to do some more tooling around downsizing the SGS it’s a single command to scale up but it’s like half a dozen commands a scaled down so we don’t do that very often it’s also possible we may decide to optimize our initial chef Ron we really haven’t even like tried or cared about it we could build a mais for each for each role that you know install the packages and all this stuff but it just hasn’t really been a priority to us yet right now like we’ve clearly we’ve built a system that addresses our current needs and constraints free well but in the future we may have slightly different needs and constraints so other remaining issues we’ve stuck with Capistrano for as long as we have we become once we’re mostly on go we will probably trigger deploys by just updating a value in zookeeper and then we’ll have a canary to play off that and that succeeds all the other hosts will auto deploy we’re not super big fans of Capistrano and not going to say any more about that also we badly need to get rid of the my sequel and Redis single points of failure we don’t have automatic failover either back end and it like hasn’t bettis yet but it will eventually so hopefully we’ll have a solution by that time and our next like big project is migrating from ec2 classic into v pcs hgs are really going to help with this and then we’ll be able to use internal lbs that you know the the ASU can like auto register with the internal lbs so we can get a cheap proxy out of the mix and it’s going to be amazing so final thoughts should know what your source of truth is make sure you only have one you either have a single source of truth or you have multiple sources of lives the more real-time your source of truth is the faster your response time can be the faster and more automatic your response time for me and you see is that our source of truth moves from get to chef to zookeeper everything just gets easier and more responsive and like I said auto-scaling groups are amazing it’s a bit of work up front but it’s super worth it you reach a point around 100 servers maybe we really don’t want to have to think about nodes and you shouldn’t have to think about nodes you should only have to think about healthy services and if you’re doing it right it should be just as easy to manage 5000 notes as 15 notes well services service nodes database nodes necessarily but we actually spend far less time managing a thousand knows than we did managing a hundred notes so thank you auto scaling groups chef and zookeeper and that’s about it I think we have about 15 minutes left there’s any questions you