Unlocking the Power of Google BigQuery (Cloud Next '19)

Just another WordPress site

Unlocking the Power of Google BigQuery (Cloud Next '19)

[MUSIC PLAYING] ELLIOTT BROSSARD: So first, a bit about who we are, who your presenters are here So my name is Elliot I’m an engineer on the BigQuery team I’m also a tech lead and manager My specialty is SQL So I’m not the star of this show, though, honestly I’m joined here by Subra, from The Home Depot And if you’re going to listen to any of it, listen to what he’s saying, because he’s going to talk about data management at Home Depot And you’ll get a lot of real-life experience about what Home Depot actually does So thank you all for coming, and I hope it’s a good session So the first thing I want to talk about is this topic of big metadata Everybody throws around the term big data, which is cool in itself– BigQuery excels at processing huge volumes of data But the thing that I want to talk about here is actually the data around that data, or as I like to call it, big metadata, in this case So for a bit of an overview, BigQuery’s architecture looks something like this There’s a lot of pieces I omitted, but you can generally think of the system as being roughly this shape So we have things here like, you can get data in through streaming ingestion There’s bulk loading We have this data transfer service where you can get data in that I didn’t even picture here We have interfaces like the UI We have client libraries, et cetera Some of the cool pieces behind the scenes are things like our high available compute cluster This is where we actually run the queries that you give to us We have our storage system And then what ties this all together is this really fast network that Google built It’s hard to imagine building this system anywhere else just because we have such a great infrastructure that makes this possible So the thing that I want to focus on today is the data that we actually manage behind the scenes for you and what goes into that So I want to start by talking about what makes a table So we’ve got a variety of attributes of a table So there’s things like the table ID itself This is like the name of it, so to speak We have this schema of the table, various attributes like whether it’s partitioned, what partition in column there is, whether it’s clustered, that sort of thing We have the amount of data that is being stored– so the number of bytes, the number of rows, that kind of thing We have the description of the table You can edit this yourself, for example And then we have the actual data itself So that’s great, but what really makes a table? What is the actual representation behind the scenes of tables in BigQuery? So we use this format that we call capacitor There’s a great blog post about this, actually I don’t have enough time to go into the details of this, but capacitor is a format that lets us really efficiently represent nested and repeated data And this is what we use in BigQuery We developed this format ourself This is sort of a secret sauce, so to speak, of BigQuery that makes queries really fast So we store this data for you in this format And when we store these capacitor files, all of these are encrypted at rest They’re durable, meaning that if one copy of a file on a disk ends up getting deleted or there’s a problem with the hardware and that kind of thing, we’re not going to lose it It’s also highly available So if part of a data center is not operational, we’re able to still serve you that data And there’s no chance of it being lost, essentially So the thing that we do is that we have this concept that builds on these files called storage sets So storage sets are a grouping of these capacitor files And a storage set is a combination of a few things So it’s a path to the files in our distributed storage It has a partition key for partition tables It has a timestamp of when we actually committed this data, when we wrote it to our metadata It has a UUID– so some unique identifier for that storage set And then it has a variety of stats So this includes things like column sizes, for example So when you run a query against a table, we fetch these column sizes from these storage sets to figure out how much data will actually be scanned as part of that query Then it has a state So it can be in a pending state It can be committed Or it can be in a garbage state, meaning that it was deleted or is no longer applicable So to give you a sense of scale of this, we have on the order of billions of storage sets in BigQuery I just ran a query last night This is a current snapshot of the system So this tends to change over time But currently, it’s on the order of billions So that’s a lot of metadata So why do we do this? Why do we have these storage sets in BigQuery? Well, the reason that we do this is, it’s the unit of atomicity And what I mean by that is that it gives us properties like this So when I select from a table at a particular timestamp, what it means is that I ignore any stored sets that are in a pending or a garbage state I only read the ones that were committed as of that timestamp at which I’m reading the data If they weren’t committed at that point, if they were committed later, then I ignore them So when they perform operations like loading data or DML queries, all that sort of thing, all these operations either succeed or they don’t And when you run a query, you never see partially-written data So the property that we provide is known as snapshot isolation Basically, you never have to worry that when you run a query that you’ll see the partial state of the table Because we have this unit of atomicity called storage sets, when we persist that data, either you’re seeing all of it or none

So what creates storage sets? So when you load it into the system, we create a storage set per partition, or when you stream data in, what we do is we actually buffer that data that you’re sending us and then periodically flush this data into our managed storage When you run a create table select statement, we’re actually taking the results of that query and writing it to storage sets behind the scenes Or when you run DML statements– update, insert, et cetera, we’re creating storage sets, as well, as part of those operations So let’s look at how this works in practice Imagine I had this query where I’m selecting from a table So I’m selecting these couple of columns, x and y, from this table here So what do we do? First, what we do is, we list the storage sets in our Spanner metadata We use Spanner Spanner is awesome This is how we keep track of all of our metadata behind the scenes So we list the storage sets for these tables from our Spanner metadata And then we send a request to our query engine that describes the mapping of these tables to these storage sets that we’re going to read as part of the query As a brief architecture, there’s a lot of pieces missing here, but you can roughly think of it like this So as a client, you send a request with a query to BigQuery There’s a couple of steps here We have this router behind the scenes that figures out where to actually send the query when we receive it It goes to this job server We have these Spanner tables for job metadata, table metadata, et cetera And what we’re doing is, we’re looking up data for the tables that you reference And then we use that as part of execution in our query engine itself And you can see down here, so we have our distributed storage system with capacitor files that are actually on disk And when we go to read a table, what we’re doing is, we’re reading the capacitor files relevant to that particular table that’s being referenced here So how does this work for partition tables, though? So for partition tables, we actually have one or more storage sets for each day for which you have data So in this example, I have data for the 7th, the 8th, the 10th, and so on And so I can have one or more storage sets for each of those days Storage sets can span partitions So what happens when I run a query is that when I select from a partition table, I match any of these storage sets for the days that are referenced So in this example, I’m selecting data for the 10th So what I do is that, as BigQuery, when I’m fetching the storage that’s relevant to the query at hand, I filter out the ones outside of the 10th here So because I had this filter where I restricted the data I was reading to the 10th, it means I could ignore any storage sets outside of that So you may be familiar with the concept of clustering in BigQuery What clustering lets you do is actually lay out the data in your tables according to some clustering keys And when you do this, what happens is that we’re actually laying out the files based on the clustering keys that you provided us So these storage set files are sorted by the cluster keys, essentially So what happens in this case? So when you select from a table that’s both partitioned and clustered and you have a query that looks like this– so I’m restricting the data that I’m reading to the 10th, here And then I’m restricting further that the price is between these two different values So if my table was clustered by price, what I would do is that first, I would apply that date filter that you gave me So I would restrict the data being read to the 10th And then further, what I can do is, because you gave me this filter on this price column, I can filter out more of the data behind the scenes And then this is passed along in terms of reduced query time, and also a lower query cost for the query itself So that was on selecting from a table But what happens when I actually update a table? So when I go and update a partition table, what we do is, we list the storage sets that match whatever filter you gave us, if any So in this case, I have this filter on the 8th or the 10th And then what I do is, I run a query and I actually rewrite the effective storage sets for this query So what this looks like is that, in this case, I’ve restricted the data being modified to these particular storage sets here based on that filter that you gave me And then what I end up doing is, I decide that, based on running this query, I end up modifying these storage sets And then I replace these, essentially, and mark the old ones as being in, now, a garbage state And now I have this new data that I’ve committed for this table And again, all this is atomic So if somebody else is running a query at the same time, they’re not going to see the partially-modified state of the storage while it’s in transit So we also allow simultaneous updates on petition tables I can actually have two updates running at the same time that modify different dates within this table So in this example, I have a query that’s modifying both the 9th and 10th here– or, sorry, the 8th and the 10th And then I have another one that’s modifying the 7th And these are OK as long as they don’t have conflicts between them So I have one query that goes and rewrites the storage sets for the 8th and the 10th, and then one that’s modifying the 7th And there’s no conflict, because the queries are reading and modifying storage sets in different partitions as part of this So we also provide this really cool feature,

I think, called Time Travel So the notion of Time Travel is that because we have these storage sets that we’re keeping for your tables, we let you travel back in time to see the historical version of a table as it was up to seven days in the past So we have these storage sets that we’re keeping When you go and run a query and add data, for example, or committing new storage sets, what you can do is that when you run this query with this clause that says for system time as of some timestamp, we can go and fetch the storage sets that were in a committed state as of that timestamp And that means that you can read the data as it was prior to a load job or prior to some DML query, for example So if you run a query and you realize, oops, I deleted all the rows in my table, I’d really like to get that data back, you can go and use this feature called Time Travel and use this clause to actually read that data as it was before you made that operation So where can this go wrong? How can these storage sites end up in a bad state? Or are things just always happy? So storage sets can end up in a bad state for a couple of different reasons So you can end up with a really large number of very small storage sets– end up having, essentially, way more metadata than you do data if you do things like microbatch your load job So if you load really small files, like, let’s say, one every minute, and it only has a few rows at a time, what ends up happening is that we have, essentially, more metadata than we do data for that table So you can also do things like stream at really low QPS So if you’re streaming, like, one row every minute, what we end up doing is, when we flush that data, we’re writing very small storage sets periodically to that table Or if you perform very small DML queries– at least for now So this is sort of a snapshot in time So the way that DML works currently is that we’re committing a storage set for each of the DML queries that you perform I say this is a snapshot in time because this is subject to change in the future So what’s the actual symptom of this? Well, for one thing, queries can take a long time to fetch metadata for executing I mentioned that we do that read from Spanner of all those storage sets So if you have, let’s say, one per minute that you are creating as part of these loads, we can end up with more than 1,000 storage sets over the course of a day And so we have to go and fetch all of those, which includes things like the column sizes and so on That’s a lot to actually fit in memory and pass around the system And so your queries can actually end up taking a bit longer than you would expect even though it may be processing a small amount of data Or because we use the number of files and storage sets, for example, as a hint for parallelism We can end up using more workers to try to process your data than we would normally do So what can you do instead? Well, as you may have guessed, you can perform less frequent, larger load jobs So we provide the ability to specify a glob pattern And you can load a bunch of data at a time by matching a pattern that’s on cloud storage, for example Or you can stream at a higher QPS Basically, use the right API for the job So if you have a lot of data that’s flowing continuously over time, you probably want to use the streaming API rather than the load API On the flipside, if you have data that’s writing very infrequently in larger batches, use the load API Or, execute larger DML queries So if you have a lot of rows that you want to modify, instead of running 10 DML queries, for example, just run one that modifies all of them in one pass But even if you don’t, there’s a silver lining here So BigQuery manages your data for you And what I mean by this is that even if you have these storage sets in bad state, BigQuery has a background process that goes and fixes that for you So you don’t need to worry about this, to some extent So even if you don’t follow these best practices right, we have this optimization process running behind the scenes that fixes up your data for you and gets it into a better state so that your queries are then faster as a result So let’s transition a bit to the topic of querying metadata So we recently released something cool called INFORMATION_SCHEMA views So if you’re familiar with the SQL standard– I’ve read it probably more than I would like– there’s this chapter called Schemata So schemata describes these INFORMATION_SCHEMA views They give you a structured way of referencing metadata in the system So far in BigQuery, this means that we include things like schemata view This gives you data set, metadata We have table-related views There’s a view about views themselves We have columns information And we have a BigQuery-specific extension called COLUMN_FIELD_PATHS that gives you an overview of the nested data that you have And we’re going to keep adding more over time So as a very quick overview of these, so we have these schemata views And this gives you data set metadata So in this example, I’m running a query that gives me the locations of all the data sets in this BigQuery public data project broken down by region So I can see that I have four data sets in the EU region, and the rest appear to be in the US region I have table metadata So I’m able to run this query over the San Francisco data set, for example, and see these tables that

have the prefix “bike,” in this case I have a views metadata table So this gives me information about the actual definition of these views So I can go run a query and pull the definition of all the views that I’ve stored in BigQuery There is the columns view So this gives me information about the column names and their types, which ones are used for partitioning, which ones are used for clustering– that sort of thing And I can go and run a query to figure out, essentially, any information that I want about these columns We also have this BigQuery-specific extension called COLUMN_FIELD_PATHS So this lets you actually analyze the paths to the data themselves So you can have strucs For example In BigQuery, you can have arrays– that kind of thing This basically gives you a way of understanding the paths to data that you can use So in this example, I have this inputs column and outputs column And both of these have fields underneath them called value And so I’m able to figure out that both of those have numeric type and I’m able to analyze that at that [INAUDIBLE] field level So I have a short example on this So I’m going to tie all these concepts together in a query, actually If we could switch to the demo, that would be great So in this demo, what I’m doing is, I’m actually rebuilding the create table statements that I can use for my tables So I have a variety of attributes here So one is, I have this function for making the partition by expression for my tables I have this clause that I’m generating So here’s the actual partition by text itself Cluster by-clause for any cluster tables I build the column [INAUDIBLE] itself Any options associated with the table And then I take these various components and I combine them all together in order to generate this create or replace table statement So what you can see here is that I’ve generated these results Let me zoom out a bit, actually, so you can see better So I’ve generated all these create or replace tables statements that recreate the definitions of my tables from these columns And I have this partition by attribute, options list, et cetera So that’s pretty cool You can also find this example in our public documentation We added this If you search for information schema, we show you how to do this yourself Can we switch back to the slide deck, please? Something else I’d like to talk about is that we’re announcing a new feature around encrypting PII So when you manage PII in BigQuery, there’s a couple of things that I want to mention first So keep in mind that data in BigQuery is always encrypted at rest Whenever we store your data, those capacitor files that are on disk are kept in an encrypted format Our API endpoints use HTTPS, so clients encrypt requests, as well You can also use CMEK to encrypt specific tables with a key, essentially that you manage yourself if you want to And we provide an API called Data Loss Prevention that can help you to classify and redact PII These are just some things to keep in mind, basically, as I’m talking about this So here’s the problem statement, though Tracking PII in your data across many tables and data sets is hard You don’t necessarily know which columns have PII or any of the tables, necessarily, that have PII It’s kind of hard to track them all down When you want to remove data for a specific user, what can be hard is to run updates and deletes across all the tables It’s hard to figure out, first of all, what you need to actually run those queries over And then it can also be expensive to go and modify all those, too So what I’m announcing today is that encryption functions in BigQuery are now generally available So this is a pretty powerful feature, but it’s also very low-level And I want to talk a bit about how you can actually leverage this in your applications These are functions for creating key sets, performing encryption, and decryption And the core use case for this is crypto-deletion of PII You can keep any PII in an encrypted format, And then you can perform what’s called crypto-deletion And I’m going to show you a bit how that works So how does crypto-deletion help? So what you can do is, you can encrypt each of your own user’s data with a different key And then when you want to delete their data, instead of tracking it down across all the tables that you may have, what you can do is you just throw away that key You just delete the key And now the data is no longer recoverable So to set this up, what you can do is that– imagine that you have this customer’s table, for example And you want to generate a key for each of your users So you can go and add a new column to the table And then you can run an update statement to actually populate the key sets for each of the customers that you have So here, I’m using this new key set function and I’m generating a new key set for each of my customers Now what I can do is that when I want to encrypt data using these key sets, I can run a query like this where I use this AEAD.ENCRYPT function And what I do is, I generate cipher texts based on that input data that I have, using the key

set for each of my customers So now, each customer’s data is encrypted using their own key I can give somebody access to this customer data table, for example All this data is kept in encrypted form And so even if somebody had access to this, without the key sets, it’s not readable There’s nothing you could do with that information, because it’s stored in encrypted form I can also decrypt data using key sets So I can use this DECRYPT_STING function, for example, and I can go ahead and recover that data from that customer data table using the key sets that I have stored in my customer table And then, as I mentioned, so when I want to crypto-delete data– so when I have a number of inactive customers, somebody asks for me to delete their data, et cetera, what I can do is, I can go ahead and delete those key sets So once those key sets are gone, I don’t have a way of getting that encrypted data back I have to use those decryption functions in order to decrypt that data And so once those key sets are no longer available to me, I can’t recover it at that point So what’s cool about this is that you can perform encryption outside of BigQuery, too We have interoperability thanks to the library that we use So we use this open-source library called Tink There’s a saying in the industry that you should never roll your own crypto We didn’t try to make that mistake in BigQuery We use an open-source library It’s developed by security engineers at Google, so we have pretty high confidence in this So we use this library called Tink And what this lets you do is that you can not only use these functions within BigQuery itself, but you can also use them in whatever applications you build outside of BigQuery, too So you can create key sets, perform encryption, decryption, et cetera either within your own code or within BigQuery directly So imagine you have a scenario where you can encrypt using Dataflow, for example, and then decrypt using BigQuery So any data stored in your tables is stored in encrypted form But you encrypted it using Dataflow And you can do that because Tink is available as an open-source library You can use it from Dataflow or whatever other application that you’re building Alternatively, maybe you store your key sets on cloud storage and your encrypted data in BigQuery So maybe you don’t want to keep the keys in BigQuery itself, for example You want to keep them on files on cloud storage What you can do in that case is, you can define an external table over your cloud storage key sets And then you can run a query that joins the data between that key set table and the encrypted data that you have in BigQuery And that gives you a way of performing decryption in BigQuery without ever having to keep the key sets themselves stored in BigQuery metadata So finally, I’ve got a couple of cool things to announce These are alpha announcements So, as you may be familiar, alpha features are not available for anyone to use We have specific programs around these So if you want access to one of these, please talk to whoever your Google contact is, whether it’s a salesperson or a customer engineer, et cetera And they can help set you up with access to these This is sort of an early preview of some upcoming features that we think that you’ll really like The first of these that I’d like to mention is our scripting alpha So scripting gives you the ability to send multiple queries in one request to BigQuery It also lets you use variables And then, furthermore, you can use conditions and loops to actually manage the flow of execution of the script that you send to BigQuery So this is a really powerful feature Historically, we’ve always supported running a single query at a time in BigQuery So this lets you build a lot more functionality into BigQuery directly, where you just hand us the script of all the things you want to accomplish and we go and run it for you And again, if you want to access to this, please talk to whoever your Google contact is about enrolling Secondly, we’re announcing the alpha persistent UDFs So there was just a comment on our public issue tracker this morning It’s been two and a half years since somebody requested this feature Why haven’t you done this yet? Well, here it is So we have this alpha now You can come and try it if you want to It’s been a long time in the making, but it’s finally here So if you’ve ever wanted to reuse functions across different queries, or maybe somebody has created a bunch of functions and wants to share them with you, you can now create them and then share them with anybody that you want to So this gives you a way of reusing functions instead of having to put them all at the top of your queries, now And so with that, that’s the end of my section Please welcome Subra to the stage, from Home Depot [APPLAUSE] SUBRA DHURJATI: Thanks, Elliot Most of the features mentioned are really game-changing here Good morning, all I’m super excited to be here before you My aim is to talk about some of our experiences in implementing an enterprise data warehouse on the Google platform Have an agenda here Talk a little bit about The Home Depot– cool, interesting facts Switch to enterprise data warehouse, which is my group here And then we touch upon the facts and dimensions before I’m going to speak about the challenges

I’m very sure everyone has shopped at Home Depot We are big We are the number one home improvement retailer in the world We have 153 distribution centers And we have like $100 billion sales My group, specifically– the enterprise data warehouse We started the journey to implement our enterprise data warehouse on the BigQuery sometime around 2016 Today, we have close to 16 petabytes of data As we speak, we are still growing, actually We have 40,000 users across the stores, corporate centers, distribution centers And also, we do a lot of mobile reporting and analytics on the mobile devices I’m very sure everybody knows your staff schema The intent is to show what it means to us, as Home Depot With all the 5,000 tables that I mention, if I draw this diagram, I’m very sure it will cover the most part of the wall behind me But as a simplicity here, what I’m trying to show is, we have sales We have orders and supply chain, customer orders, the classic retail dimensions like the product, the vendor, the stores, and the customer, obviously Touch upon a minute to talk about what facts are The facts means metrics, or the measures that are some [INAUDIBLE] tools talk about Transactional in nature, usually They could be streamed They could be in batch And other important thing is, they can be additive Example– sales can be rolled across weeks or time Or, a classic thing is, inventory is a snapshot You cannot roll it over time Next slide Here, master data Comparatively, they are smaller in size They have attributes Example is a screw size A nail or a screw in Home Depot could be 3/8 size each or 1/14 size There’s changing They change over time And they are hierarchical Example– organization hierarchy could have a location moving to the district all the way up to the division Also, here, the primary purpose of all the dimensions here is to slice and dice your facts and then to drill up and down The roll up functions, as we call them Challenges [INAUDIBLE] I know as we are implementing our enterprise data whereas– any platform for that matter, there’ll be tons of challenges, numerous And I could spend pretty much the whole day talking about these But because of interest of time, I’ll narrow the scope to just the dimensions, and also touch upon a few of these Whenever we are implementing a warehouse, especially in the data integration space, we want to find out, hey, what has changed in the source? And do a delta processing That’s one And there’s a big deal in pretty much all the business, I guess, where from a business perspective, we want to talk about as-is was as as-was reporting I’ll explain what that is in the next slides And the other thing I want to touch upon is the versioning How do we store history? By the way, most of the points are interlinked One key thing, again, I want to mention here is about surrogate keys One of the lessons learned with the advent of data scientists and analysts who can write their own queries– what we have found is, it’s better to avoid surrogate keys They’re very 1999-ish to the early 2000 It confuses them, because you have to now use surrogate keys to do all the joins It’s good for operational reporting, but not for deep analytics That’s the lesson we learned Change data capture I’m very sure we have a lot of– every company has a lot of sources from which we extract the data Some of the sources, especially with respect to legacy, doesn’t have a method to know what has changed There’s no timestamp column to know what has changed That’s one scenario And there are other cases where the source has a last update timestamp which you can use to extract for the delta processing The third– there are sophisticated systems where, for example, with database replication, they can tell, hey, these are all the changes that you have

And that makes life easier in data integration space As-is as well as as-was reporting It’s a big deal, actually, if you look from a business perspective, especially in Home Depot As I mentioned in the earlier slides, the dimensions are hierarchical They change over time They could be still changing But they change over time So most of the businesses– they want to see all the metrics, analyze all the metrics to the current point in time, the most recent hierarchy Example– as you move to a different department and they want to see all the sales rolled up to that department, what it means– I’m very sure a lot of companies do build aggregates for performance To [INAUDIBLE] single version of the truth, all the aggregates need to be restated We call that internally, in Home Depot, as recast So every day, we recast all of our metrics to the most recent hierarchy But there are business analysts and data scientists, especially in the legal departments and the marketing, who want to see all the metrics to a particular point in time You have to go back like, example, give me my sales as of the hierarchy in 2007, or give me the address of the customer as of 2010 To do that, we have to enable as-was reporting To do all of these, you need to keep the history That’s the point here, I guess Next slide This slide is multifold I want to talk– I know I put as one of the methodologies for versioning to implement our history, but I’m also going to talk about our Home Depot methodology here, of how we do data integration with the ELT approach And the third point I want to also make here is, adaptable to use case where the source does not tell me what has changed So I’m going to go through that use case, too, in this slide Going from left to right, we have the source table It could be mainframe It doesn’t have a last update timestamp So our integration methodology in the data injection is to extract that into a raw data set When I say raw, it means so similar It’s pretty much close to what source is And then we have to scrub the data So we scrub it We’re using [INAUDIBLE] bunch of BigQuery functions that we can use to scrub the data and then land it into a staging database Again, this is an ELTV loading to BigQuery And then do all our massaging of data there And then at that point, now the data is pretty much what we need, I guess But as I said, the source didn’t tell me what has changed It has all the rules So now, I’m going to go to my yesterday’s copy, or the previous run’s copy of my target [INAUDIBLE] full outer join figure out what has changed If it is in the source but not in target, that means there are new rules If it is in target but not in the source, that means something else got deleted If it is in both and the columns have some different, that means they’re updated We take up all these changes, stack them up in a table which I’m calling it as Change Capture Table And if you see, that’s in the data data set So we have a raw, which is so similar We have a stage where it is temporarily massaged And then the final is all in the data data set The other thing is, now we have to do the as-is was is as-was So to do the as-is, we use BigQuery functions like rollover right at the analytical functions to get the most recent row and put it in the target, which is our– we expose to business I didn’t represent the as-was case, but you could always use the analytical functions like lead, lag, all those functions, and get to effective begin and end dates and do the as-was reporting One thing to point out here is, from our legacy warehouse to which we moved to BigQuery, we do a one-time conversion And I have to put that data into the target The same thing, now What I did is, I explained using the data here

And one thing I want to point out is, there’s an active flag So if a record has got deleted, we make the active flag as false Moving on Pros of this approach– it’s very easy to implement and it’s self-healing You basically stack up all the changes Let’s say if some transaction comes late, it still goes in and self-heals because there’s nothing much we are doing We are stacking up and pulling the latest It works for even stream data Though my example said that I am doing a full extract, this approach works beautifully well for the streaming Cons One of the cons I heard from business, if they are not that SQL-savvy, is, you have to use some analytical functions to get the as-was But to mitigate it, we could always precreate the tables and export them Then, if the tables are huge, I know sometimes the customer dimension could and similar dimensions could be multi-million to billion records There might be some performance issues If you’re joining to, really, a multi-billion fact table, that’s one thing to keep in mind We have the next approach Again, this slide is also multifold Here, I’ve used streaming use case But it could work even for a patch Just for representation, I put the streaming Moving from left to right, in this case, the source is sophisticated And by the way, these are live examples that we are running in our warehouse So it is sophisticated They use a data replication technology to identify, give me all the changes, including, they say, these are inserts These are updates And these are deletes And we use a transport process to put it onto the Pub/Sub From this point, we want EV [INAUDIBLE] this data now So again, sticking back to our Home Depot methodology and [INAUDIBLE] methodology here, is to put it in raw We use the Dataflow to put it into our raw database with all these changes From that point, we do a bunch of transformations because we are following the ELT approach, here We could do something that pops up But for these, we are especially using ELT– reason being, our technical staff are really familiar with the SQL Our existing, earlier legacy warehouse had a bunch of SQLs to do the ELTs, so it was seamless for faster development We put it into the staging And then we use the BigQuery merge DMO statements, which I believe was launched around a year ago I think so So once this was done, most of our tables now use this merge statements with the effective begin and end dates already in the table Next slide The same thing I explained as example with the data here Pros and cons Compared to the other one, the pro is, the effective begin and end date is already in the table And you could do the as-was and as-is type of reportings easily using the effective begin and end dates What we have found is, this approach works well for tables with– I think we have a couple of dimensions with a billion records’ worth with 80 columns, and it still works well The cons– it’s slightly complex If we stay in stacking, we just stack it up and pull the latest, whereas here, you have to take it off some level of programming here And the second one– when we are doing streaming, it adds a different level of complexity here, the reason being, you have to now use effective begin and end timestamps rather than dates It adds a slight layer of complexity What we learned So obviously, there are a lot of challenges with when you’re doing a data integration in a big enterprise data warehouse space We are using the ELT methodology because BigQuery has all the rich functions

And still, I haven’t found something that is not available in BigQuery that we cannot use to cleanse our data, pretty much We use this approach to do our data integration And it so far is working well And we migrated, integrated close to 5,000 tables in less than a year Hey Spring is finally here It’s time for yard work and mulch and– [LAUGHTER] I would encourage you, stop by your local Home Depot store [LAUGHTER] [INAUDIBLE] one of my coworker there will help you all Thank you Thank you all [MUSIC PLAYING]