Cloud Data Fusion: Data Integration at Google Cloud (Cloud Next '19)

Just another WordPress site

Cloud Data Fusion: Data Integration at Google Cloud (Cloud Next '19)

[MUSIC PLAYING] NITIN MOTGI: Hello, everyone My name is Nitin Motgi So I’m one of the group product managers in data analytics space You guys have seen me on the keynote demo there But if you have not, I’m going to show you what data fusion is all about So before I get started, I wanted to give you guys enough context of where it’s being used I wanted to introduce my colleague Robert here who’s from Telus Digital So he’s going to basically give you guys the context in terms of how they are looking to use data fusion So with that, I will hand it over to Robert See you guys in a few minutes ROBERT MEDEIROS: Thank you, everyone All right So I’m Robert I’m a software architect with Telus Digital If you’re not familiar with Telus, it’s a Canadian telecommunications firm that operates in the traditional telecom verticals We offer phone, internet, mobile phone, television services It won’t surprise you to learn that in the course of conducting our business we collect a very large amount and we generate a very large amount of data that we need to carefully and responsibly manage That data also needs to be sifted for whatever insights we can glean in order to better serve our customers So we’re a growing business And we’re moving into new lines of business all the time Some of those areas come replete with their own data specific challenges In some instances, we’re confronted with a particular security strictness that we have to contend with In other areas we have compliance and regulatory regimes that we have to deal with In still other areas we have particular data volume and velocity challenges that we need to contend with So all of this is to say that over the course of our 100 year history and a number of mergers and acquisitions where every new member of the family has come with new data standards, new data processes, systems, our landscape has grown extremely complex This complexity has been very hard to contend with It’s hard just to grapple with it, to understand it, let alone to try and merge it all, fuse it into a unified whole So to try and understand where some of the places we were falling short are, we recently conducted a survey of all the participants in our data ecosystem from which we learned that the average score people were willing to give to our data and our data systems was a rather middling 2.74 on a five point scale Not as good as it could be, unfortunately The data that we received was rich enough to reveal an interesting pattern, which was that the closer you are to generating data, particularly if you’re a human that only deals with a single database or a single data generating process, the happier you tended to be with your lot in life If you’re a person that was responsible for taking data and transforming it, somehow relocating it or handing it off, you were generally a little less content with your lot And if you were a downstream consumer of data, you were generally somewhat unhappy This is particularly if you needed to rely on a synthesis of data, cross-functional data from a lot of different upstream data sources So generally speaking, the picture was that the higher you went in the organization and the more data you had to touch, the more unhappy you were And since that describes senior level decision makers, we realized that we had a bit of a problem that we had to solve The implication for us was that we needed to walk before we could run We’re very interested in exploring some sophisticated capabilities like machine learning We wanted to build out an AI program But we realized that before we could do any of that we needed to fix our data problems So the answer for us was to explore the principles of supply chain management and to undertake a data supply chain program So building a data supply chain is replete with many challenges Those challenges include the need to integrate data from a great number of sources, integrating meaning clean, validate, refine, reconcile data through an often complex chain of lengthy transformations We also have to handle the entire data lifecycle Our data is born, it lives and is in active use, and it retires and eventually becomes obsolete And during the entire course of these events, we need to avoid breakage We need to provide a consistent and canonical view of data to our users And in particular, we want to compute key business metrics just once and in just one way if possible We needed to better understand our data We need an ontology to ascribe shared meaning to data And in particular we needed comprehensive metadata and lineage information about all of our data down to the field level for every field We wanted to build a unitary system, something that supported many roles from less to more technical without causing an explosion in the number of tools that we had to build or support And we wanted something that was ubiquitous, something that lived in all the places where our data lived and made data location transparency pivotal Ultimately, we needed safe and secure delivery of our data to our downstream consumers

We wanted to break out of all the data silos that existed in the enterprise All of this has to happen in the context of a world where cloud is an important part of what we do And we’ve learned that building hybrid cloud architectures is challenging Some of those challenges include the need for portability We need to build things locally, develop artifacts that we can run locally but also run on prem in our data lake for example, as well as to moving into the cloud for scaling and other benefits that cloud offers We need to be able to distribute our data pipelines so they seamlessly bridge from on prem to cloud so that our data can flow back and forth without any friction We need extensibility So we need standard hooks, well-defined places to add our own business logic, to add data transformations, to add connectors to various data sources, ideally without bespoke integration Every bit of bespoke code that we have to write to support these things means an additional burden in terms of technical debt that we’d rather avoid We have the problem with affinity Some applications just resist hybridization Some data sets have to be pinned to a certain location for various reasons Conversely, some data has to be duplicated And that comes with its attendant challenges of having a unified view of your data, maintaining data lineage, and so on And the problem of testability is key In as much as things vary from environment to environment, we want to isolate those things, test them once We want to test once not test once per environment Right We want sufficient abstraction that we can build our pipelines, build our data transformations, build our logic and test it once and know that it’s going to run in all the places where we need it to So taking GCP as our example of a single cloud– we’re here, why not– we realize that we suffer from an embarrassment of riches The great number of services on offer mean that we are faced with service integration challenges We’re faced with a fairly steep learning curve Building even a relatively simple pipeline– imagine landing data on GCS, modifying it with Dataproc, pushing it into BigQuery or Bigtable, and ultimately surfacing it in Data Studio– recall what it was like to be a newbie to the platform and how daunting all of those things seemed, and then imagine having to stitch them all together This is something that we needed to come up with a way to address All of this impacts our speed to productivity The more things that were touching and trying to stitch together, the greater our challenge We wanted to find a way to make easy things easy In spite of the fact that digging into specific tools yields benefits, greater performance, lower cost, we still need a way to tie things together without having to be deep experts in every single part of the data pipelines that we build So multi-cloud We live in a multi-cloud world and there is no escaping it This comes with the challenge of finding people that are sufficiently skilled Finding folks that are knowledgeable about one cloud platform is hard Finding unicorns that know and have deep expertise across multiple clouds is correspondingly more difficult. When things inevitably go wrong, finding the root cause of the problems is made much more complicated when your applications are parceled out across multiple clouds And with a greater mix of tools and services available across clouds, finding the right mix of things to tailor for precisely your application is more difficult And of course, there may come a time when you want to move on to or off of a particular cloud And to the extent that you’ve tailored your applications to the interfaces of those cloud, you’ll find that you have a higher switching cost to pay So the question before us is, can we find or build a tool to tie a broad array of services together near seamlessly across our on prem and cloud infrastructure, multi-cloud infrastructure in fact, in a way that minimizes the cognitive load of taking advantage of all this power? We want a platform that supports the end to end data lifecycle, something that helps us manage data from the moment of its capture till its eventual retirement We want something that helps us better understand our data and processes, something that explains the origin and the lineage of all the data that flows through the system We want something that supports a broad mix of less technical and more technical users, giving all of these folks a home, something within which they can easily access the data that they need and build flows to shape it in the way that they want And to add on a few extra bullet points, if we could find something that was open source and that has a community, so much the better If we can find something that we know scales from micro jobs all the way up to massive, and if we could find something that’s sufficiently flexible that we can mold and shape it to our needs, we’d be that much happier So about a year ago I was introduced to something called CDAP CDAP was an open source tool for big data processing And having dug into it, I became very excited It seemed to address a lot of these points that I’ve mentioned And it wasn’t so long after first encountering the tool and getting excited I learned that it had joined the Google

Cloud family of services and had become what you’re going to hear about today, Data Fusion Since then, Data Fusion has become an important part of our data integration efforts as we move our data supply chain into a hybrid and multi-cloud world And Data Fusion solves a number of the pain points that we suffer And I’d like to introduce Nitin to tell you a little bit more about it NITIN MOTGI: Thank you Thank you Thank you, Robert ROBERT MEDEIROS: Thank you NITIN MOTGI: So let’s talk about Data Fusion So Data Fusion What is Data Fusion? So Data Fusion is a fully managed cloud native data integration service It’s basically helping you to efficiently build and manage data pipelines With a graphical interface, and a broad collection of open source transformations, and hundreds of out of box connectors, it helps organizations shift the focus from code and integration to insights and actions It’s actually built on an open source technology called CDAP CDAP has been in existence for quite some time now This is a managed version of it So we can talk more about CDAP in general But CDAP as a platform allows you to build data analytical applications And as part of data analytical applications, there are a few analytical applications that are also included as part of CDAP And one of them is ability to build data pipelines The second one is more about how do you transform the data without having to write any code We call them Wrangler, data prep The name has been constantly changing But it’s an ability where if you are specifying transformations, you’re applying data quality checks, you don’t need to be writing any code And that is another accelerative [INAUDIBLE] application that is built on top of CDAP So we basically– there are more other things like that like rules engine, a bunch of things like that, which we will look at it later But what we have done now is taken those two parts with ability to build pipelines and add ability to do transformations without having to write code And along with the CDAP platform, we have turned that into a managed offering on GCP So when you look at the kinds of use cases that we can use to solve this, and the problems that it is trying to address, first thing that it is trying to address, like Robert mentioned, it’s making it extremely easy to move data around Right So right now, if you’re doing it all by yourself, it’s very error prone It’s very time consuming So if I have to move data from, let’s say GCS to BigQuery, I would have to spend a lot of time writing code and ensuring that that code actually works every single time That is not to say if I have to do another such point A to point B transition with added transformation, I start all over again So essentially, at the end of it, it increases your TCO Right So just for moving, you’re spending way too much time and then you’re actually getting into situations where you might not be able to address all of the business requirements that you have Right And also skills gap is another thing So it’s like you need to have right set of skills and have good expertise in the systems that you’re stitching them together, because there are performance requirements There a bunch of things related to how-tos, best practices you need to be adhering to So all that is very difficult. Right And the last one is hybrid And hybrid is something that is very interesting actually During our process through EAP, which is the early access program on GCP, we learned that actual journey to cloud starts on prem It’s not like lift and shift and everything is done It’s kind of hard You need to be able to start your journey slowly from an on prem environment So just by the combination of CDAP and Data Fusion, it allows you to do so And you don’t need to build it twice You can build it once and be able to run them in two different environments So who are the users that we are targeting with Data Fusion? So we are leveling– we are moving the level up in terms of being able to adopt GCP and being able to apply transformations When you look at– when there is a need for developer data scientist or business analyst to cleanse, transform, blend, transfer data, or standardize, canonicalize, automate your data operations, data ops, we want you to be using Data Fusion Data Fusion provides a graphical interface With that, it also provides ability for you to test, debug, as well as deploy You can deploy it at scale on GCP And there we can essentially scale to your data levels Right So it essentially can scale to petabytes of data there Now in addition to that, we also collect a lot of metadata, whether it is business, technical, or operational

All of this metadata is aggregated within Data Fusion And we are exposing that in terms of lineage So you will be able to do things like root cause analysis, impact analysis, be able to find provenance, be able to associate a lot of metadata for the data pipelines that you’re building as well as data sets that you’re creating with it And that’s a very important part of having data operations anywhere So when you look at the kinds of use cases that Data Fusion is trying to solve, Data Fusion– these are different business imperatives that translate into IT initiatives and then maps to how Data Fusion can help solve So when you are looking to build warehouses, you’re actually trying to get data from many different sources, let’s say into BigQuery as an example You should make it extremely easy to do that It should not take you years, six months, not that long You should be able to operationalize these things pretty quickly When you are actually looking to retire legacy systems, Data Fusion can help you migrate that data over to GCP it doesn’t need to be just BigQuery You can put it in Spanner You can put it in Cloud SQL, data store, Bigtable You can pretty much be able to connect to anything that is available on GCP to bring that data in And it’s not just limited from a source’s perspective to on prem You can also read from the same set of sources that you have written to in GCP Data consolidation is another aspect So you’re trying to either migrate some data or essentially retire a bunch of data that exists there Master data management is something that has a bunch of capabilities but we are going to be adding a lot more in this year to essentially help create a much more consistent, high quality environment for your data And the last one is extend or migrate in cases on a hybrid environment that you want to load shed into cloud If you are still on prem, you will be able to do those kind of things The thing is the complexity of it being able to run on cloud, on prem, and in other clouds That’s the biggest part of all this You should be able to run that across anywhere So just to put it all together, so the way I like to think about this is Data Fusion is providing a fabric which allows you to fuse a lot of different technologies and products that are available on GCP in a much more easy, accessible, secure, performant and an intelligent manner It just makes the entire process extremely easy So the things that would take me six weeks to build, I can literally build them in two minutes, deploy them on the third minute, and be able to operationalize Operationalize should not be that easy, but like it takes let’s say a day or so Right So basically, this is bringing a lot of different things together, making it very simple It’s moving up the bar, lowering the barrier to entry essentially for GCP when you talk in terms of data So with that brief introduction, I’m going to show you guys a demo of Data Fusion So let’s get started So Data Fusion, as I said, is a managed offering So it will be available through Cloud console So you will be, starting today, you will be able to see Data Fusion show up on the Cloud console So now what we have done is we have basically taken Data Fusion and given two different editions– we have made two different editions available to you all One is the basic edition The second one is an enterprise edition You should be able to pick depending on your needs And we recommend basic edition to be more for dev I would say, not so highly available to your QA environment But enterprise edition is more for mission critical stuff So you should be able to provision off of those two editions to start with So here I have already created a few of them I mean, these are all you can see These are Next demo stuff, so a bunch of different things that have been created and all in enterprise But I can go into Create Instance and be able to specify And here are the zones and locations that we are starting with So we have Asia East, Europe West, Central, East, and West, all of these So as we go along, we’ll be adding more locations

in this year So once I clear the instance, I can jump into Data Fusion entirely So this is Data Fusion It looks a lot like a cockpit where essentially you have a lot of things that you have interacted with This is called a control center Control center is the place where you are actually able to monitor and be able to manage all of your data sets and data pipelines This is a central place for doing that So you also have an ability to work with the pipelines This is the list of explicit pipelines that you are interested in here We have put a few pipelines that have been scheduled to run and have been running Additionally, you also have a studio where you can build different pipelines It’s a visual way of building all of these pipelines, being able to deploy them And also, we have a place where we are collecting metadata I skipped on one, Wrangler I’ll come back to that This is where we are collecting all of the pipeline metadata as well as all of the different data sets The data gets collected here So the interesting part Now let’s imagine you have data sitting in an Oracle system So like it’s an on prem Oracle system and you want to bring data in And it’s not just limited to Oracle actually We have hundreds of out of box connectors that allow you to connect to various systems starting from mainframe to being able to connect to a bunch of different cloud sources Right So in order to start with, we have started providing a very simple way of bringing data in Right So first thing you’re seeing here is I have pulled in the data from an Oracle table Now let’s see how did we get here Right So Wrangler allows you to connect to various different data sources These are the popular ones We are going to be adding a lot more in in coming months actually So I have already preconfigured an Oracle table here I can actually add more databases Databases that can support JDBC can be easily added here We also support Kafka, S3, GCS, BigQuery, and stuff So as soon as your instance gets created, it automatically attaches to your projects in GCS, BigQuery, and Spanner So now let’s say I have data sitting in Oracle here So this is now live listing all of the Oracle tables And I’m interested in bringing in one table only, not to say that you have ability to also bring the entire database over So you don’t have to do one table at a time You can bring the entire database over So let me click on the employees table As soon as I click, the data gets sampled There are many different ways the data can be sampled But the data gets sampled All of the data types of the source data system gets translated into an intermediate, which is essentially it’s AVRO, where it gets translated into AVRO types So it’s basically an AVRO spec, but it is an extended AVRO spec to specify various other types So now once I’m here, I can apply transformations So now the first thing I see is commission percentages being highlighted for like there’s some data that’s actually missing here Now in order to process, I want to make sure that I can specify some default values So I’m going to fill 0.0 here So there’s 0% commission I can now see it’s 100% complete Right I can also see that the job ID is a combination of two things, department and title So I want to be able to specify a way to split those So I can go in here I say extract fields Like delimiters So now we generated two columns Right So now I don’t need this specific column from here And I’m good Right But you want to get a little bit of insight of how your data is organized So we basically make it easy to get a little bit of insights on your data So there is a way where you can essentially slice and dice this data in and have a visual way of looking at how the different dimensions are organized, how many different records of a certain type So you can do drill down on these and look at and compare that with other dimensions, right? So this is just to give you a feel of how your data that you have pulled in is– how the sample that you have been pulled– is organized So once you are comfortable with this, I will say go create a pipeline because my source right now is Oracle It’s not a real time source, right? So it’s a bad source right now, because we are pulling it in bad, but you can also do the real time data pull from Oracle, so if you are doing things like CDC and such So I want to create a batch pipeline So what that does is, it transfers all the work that I did in Wrangler into my studio, where I can upload

the rest of the pipeline My goal was to get this transformation Now, I started with Wrangler to create this, but you can go the other way, too You can actually go from being in this canvas sort of Studio to being able to wrangle that So it’s just by dropping one of the transforms, like Wrangler transforms here, you should be able to do that seamlessly For simplicity I’m just using a very simple pipeline So if you then you want to be able to, let’s say, now this is not operable– this is not going to be operating on just a sample, it’s going to be operating on the entire table that you’re trying to pull So then I can go ahead and figure out, like I would need to put this into a BigQuery for analytics I also want to make sure that I can actually store this data or archive this data into GCS Right, so it’s as easy as that Now I configure these Configurations are also pretty easy here So you just define the configuration, which is your data set, your table name, that’s all you need to specify Those are the mandatory fields, you’re done That’s how you essentially bring data into BigQuery The same holds true for GCS perspective You specify how your bucket needs to be there, what do you Suffix Path is, and you can also define the format in which you want to write that data in It’s not limited to it So now the Wrangler that I specified, that I showed you guys, that have more than 1,000+ functions, which include a lot of data quality checks which I’m going to show you guys But the most important thing with that is it is extendable, you can write your own data transformation– we call them directives You can write your own data transformation directives And you can just drop them in and use them in any pipeline That gives you share-ability So to show that, let me see, after doing this I feel like, oh man, I missed something So I want to go back and do more operations here Right, so I want to be able to say, like, for example, this Employee ID change my type to, let’s say, long So I should be able to do those type of things So I go back and I make those transformations and I am done with it So, let’s see, I go back OK so all of the transformations are here, so now whatever I have done can be configured So right now the pipeline– we provide two different ways of being able to execute the pipeline You can run them as a Spark job, you can run them as a MapReduce job, but we are also going to be in the future adding an ability for you to run that as a Dataflow job So you build a pipeline, you’ll be able to run in any of these three things without having to make any change to your business logic Let that sink in for a bit You then have the ability to specify resources So, like, this is actually for MapReduce, OK, that’s not that interesting, it’s the old one So let’s go look at Spark So you can change how Spark behaves So you can specify configurations, you can specify a very specific– We don’t recommend, actually, for you to do this, I’m just showing you that these are available if you guys are interested in fine tuning your particular pipeline We also do support alerts So again, these alerts– we do provide a developer’s ticket that allows you to extend these and add your own thing For example, one of our users actually wrote a Slack connector, so every time a pipeline finishes it sends a Slack message I think there was one– someone wrote a HipChat one and there was Trilio one Like a bunch of things like that So you can make the notifications pretty easy with these So, in addition to this, you also have the ability to schedule your pipelines from here You can define them here or you can define them once they’re deployed Now the interesting thing here is you can actually debug the pipeline As soon as you go into the preview mode it provides you an ability to test your pipeline And it actually is going to go to the original source and bring that data in So right now it supports N number N is defaulted to 100 rows, but if you want to run a larger sample, you should be able to do that, it just takes a little bit more time when you are looking to preview terabytes of data But we are looking to improve that and essentially be able to do this constantly in real time so that you can look at the data as you are building the pipeline So all of this actually translates– one question that I always get asked is, Is this generating a Spark or MapReduce code Is it a code generator? And my answer to that is no, it’s not a code generator I don’t know how many of you guys know, but the word that was coined a long time ago called code weaving So it weaves all of the different components

that are being built for one execution engine in an optimized way So there is actually a planner that picks up all the bits and pieces and translates that into an execution plan So it weaves all of the stuff that you’ll build into a pack and it figures out what is the optimal way of transferring the data from one node to the other In most cases it’s amazing because you’re now going from one machine to the other In most cases, you’re actually doing in memory transfers and that’s how it gets composed into the execution paradigm And that’s the same thing we can do for Renegade and Spark, we can do the same thing for Renegade and MapReduce And with that, the ability for us to be able to push down some of those optimizations– basically leverage the optimizations that are available in the underlying system– is extremely critical And we’re able to do that without having to make major architectural changes to this Now once a pipeline is built, you can actually see it generates a configuration, right? It’s a simple JSON configuration, you can build it all by yourself with hand It’s not very complicated It has just a few sections And for everything there, in addition to being able to define the graphs, it actually tracks every individual node the artifact version So you can essentially track every component that is being used, is being versioned, actually So you’re able to version every part of it So if you come up with a new way of connectors, or are updated, if there was a bug and you fixed it, you should be able to use that seamlessly And the system allows you to migrate from version A to version B, making sure all of those checks have been handled carefully So with that, this is essentially an ability to build a batch pipeline, but you can also do the same thing with real time pipelines, actually So you can build a bunch of real-time pipelines Looks like there are a few– there are not that many plugins available here, but that’s the reason we have added Hub So Hub is a place where you can– we will be adding more connectors, more transformations, more reusable components here, that we’ll be making it available here And the best part is you can have your own organizational internal hub So you can share some of the connectors of plugins that you’re doing with the other users within your organization It makes it extremely easy for you to do that It’s a spec that we ask you– it’s actually a documented spec So if you can follow that spec, you can get in exactly the same market for your organization In addition to what we do provide And here, where we put stuff, there’s a lot of things that we have built over the years, over the period of seven years But we’re also getting a lot of contributions from within Google, our open source community, as well as, we are working with a lot of partners to be able to build this And the one important distinction that we’re making, right? When I talk about Data Fusion, Data Fusion is actually not– it’s unlimited use At every instance there is no limit on the number of users who can access it That’s number one Between basic and enterprise we are not distinguishing connectors Every single connector will be available in both of them Unlike when you look at how we essentially do connectors, generally we didn’t want to give that choice Essentially, we wanted to keep it plain and simple, so every connector would be available of the things that we built So from here, you want DB2 plugin, you can just deploy it, it gets added to your instance That’s it With that you are now able to use the DB2 plugin to talk to DB2 instances So the same thing holds true So there are a lot of different real time connectors that are available Transforms get shared Analytics of different kinds, where you’re actually doing joins or you’re actually aggregating data, you’re profiling data All that stuff is readily available And the real-time pipelines actually, right now when they run in GCP, they run as Spark streaming pipelines OK, with that, let me show you a little bit of– a few things that you can do with this So this is a very simple example of a pipeline that was deployed here, where you’re essentially taking data from various different sources, joining them together, and writing it into a BigQuery This pipeline, I think, has been scheduled to run You get operational metrics on this pipeline You’re able to see how this pipeline is behaving over a period of time All of this data that I’m showing you, all of the stuff that I just did from, sorry, from the UI, you are able to do those using REST APIs Every single aspect of this is available through publicly

available REST APIs So if you’re looking for something on this dashboard, it’s actually available as a REST API for you to pull so that you can integrate with your internal systems if you would like to So, now, if you see here, what’s happening is, when you ran this pipeline, actually, this pipeline is using a Dataproc profile that is automatically created for you So there is this notion of provisioners and profiles and profiles are based off of provisioners So here is a profile that the system automatically creates for you And you can see, using this profile, all the different pipelines that are being executed and how much time they take from an execution perspective So you can define different profiles for different kinds of workloads You know, if you’re doing large migration, you want to be able to say that you’re doing heavy ETL transformations You should be able to specify a profile for that So something that can be controlled by administrators, but also, if needed, that users themselves can change them if they have the right permissions So you have also the ability to look at every run of the pipelines So this is just one execution of the pipeline So it can go back in time So we do keep, for long periods of time, all of this data in Data Fusion And one important thing, so this pipeline could be scheduled, right? This pipeline also can be triggered by the execution actions that are generated from the previous pipeline Right, so I can say trigger this run of the pipeline only when the previous pipeline successfully completes, or it has just completed, or it has failed, then trigger it So you have an ability to do that And when you talk about different use cases where you’re looking at many different teams involved in building a much larger pipeline which is organization-wide, or there’s a team responsible for bringing data in from external sources, standardizing them, landing it in a staging area And then from there, analytics team picks that data and runs their analytics and essentially do some aggregations and write them into BQ Another team could be taking it from there Now the thing is, information needs to be passed from execution of one pipeline to the other So for example, where you are landing on the staging area So with tokens that are being passed between these pipelines, you’re able to do that And you can pick and choose whether which tokens you want to pass And in this example you have plugins you can pass, runtime arguments you can pass, you can even pass compute configs in there So this is just a simple example of where you’re doing all of the different joins There can be far more complex pipelines you can build So I want to show you guys one other pipeline So this is a real-time pipeline which is not running, it should be running So this is a very simple real-time pipeline which actually is watching for data that is arriving on GCS As soon as the data arrives, it passes, cleanses the data, sends it to data profiling It profiles that data for that window whatever you have specified, right? Like, if you specify a 15 minute or an hour data that’s coming in, profile that, and that profile information is actually returned to Pub/Sub Now all the good data gets returned to BigQuery and any of the errors that are generated in the records because of any bad data actually get separated out and gets written into GCS So this is a real-time program and, of course, it’s a microbatch But when we do integrate this with Dataflow, you will have true streaming capability and you’ll still be able to do this kind of stuff So very simple things This is how we are taking it to the next level So this is an example of a use case where you want to move the data, but you have no idea of the risk profile of that data And you want to be able to make sure that if there is any sensitive data, you want to exclude that out before you send it to, let’s say, your final data store, right? So this is something that we are going to be making it available soon in Q2, which is integration with DLP So if I want to integrate with DLP, let’s just drag the DLP and put it in the configurer, your DLP work is done, right? Now, you might say, oh, how do we ensure that every data set that I have needs to always be passed through DLP? So think of those two nodes and you are essentially creating different kinds of data sets Like, for example, this is a GCS one that is being sent to, let’s say, DLP

So you will have the ability to create templates You will take GCS and you will take DLP and you create a template out of that And what that will allow you to do is essentially be able to– anytime anyone who uses that it will always be applying DLP transformations on that or DLP filtering or tokenizations on that So that is much more an easier way for you to [INAUDIBLE] things This is a very filtered example The filter example is actually allowing you to only filter data, but we’ll be adding more capabilities in terms of being able to tokenize fields and such So you will have ability to– right now you say, with some confidence interval, which is like if a filter confidence is low, anytime when there is credit card numbers that gets identified as sensitive data, filter it out So I can pick and choose a bunch of different things I will soon be able to also tokenize them and encrypt them, a bunch of things like that So this is all great So this is all about building the pipelines I showed you guys how you can operationalize them, what are the metrics that are available Logs are also similarly available in the same fashion But the most important thing from an operational perspective is, can we actually monitor all of the pipelines in one place? So you have an ability to monitor, and there are more things that would be enabled soon This is actually looking at all of the pipelines that are running on an hour by hour and you will be able to look at each individual pipeline and how they were being executed, what was there, like what time did it take? In fact, we also include things like start delays So what ends up happening is, we have a lot of customers who schedule the pipeline, actually, to be running at like, let’s say, at 2 in the morning Because it happens all independent of each other The capacity within that project gets hit because everything starts around the same time So how do you avoid that? So this has a little bit of a futuristic view, one Two, it also gives you an idea of how much time it is taking to start something up because the resources are not available So you can actually go into future and look at what are the different jobs that are being scheduled at every hour And what does that mean from a capacity perspective to be able to run that in that hour So you get a visibility into those kind of things, which makes it extremely easy from an operations perspective So one last thing before we go into Q&A session is the metadata that we talked about So I can search– this is all in the context of integration, right? So this is all in context of integration and I know you guys will have lots of questions around how does this integrate with Data Catalog? So let me give you an answer before you ask But the whole idea here is we are both at a beta product right now, but at some point, we will be making a read-only version of all of this metadata available in Data Catalog, number one Number two, we’ll still continue enhancing the experiences of solving the problems from a data integration perspective in Data Fusion So that will still remain here So things like deep lineage, field level lineage, a bunch of things that are very, very relevant to integration would still be available in here So this is the data set that I was looking at that was the joint pipeline and I can see the schema here, we can attach business tags So this is like, I don’t know, I’m just going to say doubleclick here So I should be able to immediately search based on that You can look at lineage at a data set level, but the interesting thing is, you can look at lineage at the individual field level So you are able to actually figure out what happens to individual fields when the data is being ingested So, for example, if I’m looking at a landing page URL, this actually comes from a landing page data set If I’m looking for a referrer URL, it’s coming from impression data set If I’m looking at advertiser field, it’s coming from an advertiser data set But the interesting thing is, what are the operations that are being applied as the data was being moved? It went through different joins, it was read from the source So our goal is to be able to write this in a normal, like an English text, so that you can just read this to be able to see how your data is actually going to be moving That is the goal and we basically are starting to do that, but there’s a long way to go in terms of making it full English text so that you can just read it But you’ll get an idea of where we are going with this So just being able to look at that the data has come from and where it is going to go gives you a lot of flexibility to do that [MUSIC PLAYING]