Building the BBC GoodFoods Graph (Neo4j Online Meetup #51)

Just another WordPress site

Building the BBC GoodFoods Graph (Neo4j Online Meetup #51)

– Hello everybody, welcome to the Neo4j online meet up I think this is number 51 So welcome to everybody If this is your first time joining us, if you have any questions while you’re watching live you can ask those on the chat on the right hand side I’ll be watching that chat, so I’ll answer any of your questions on that While we’re doing the presentation, there’s going to be a bit of code in this presentation so you want to make sure that you’ve got your resolution at 720p or higher So make sure you have that set up And what else do we need to do? If you’re watching afterwards, we have the community site, so if you have any questions don’t forget that you can always ask those on there as well And I guess if you haven’t attended one of these before, my name’s Mark, I work in the developer relations team at Neo4j and today we’re going to be talking about the BBC Foods Graph And I’ve got my colleague, you can just see to my right, though to the left on the camera So Lou, so I guess what we do is most of our guests on the online meet up just give a quick introduction to yourself and maybe just a quick, few sentences on how you got into Neo4j in the first place – Hi everybody, I’m Ljubica Lazarevic and I’m one of the field engineers here at Neo4j So how did I get into Neo4j So in a previous life I was looking at master data management and how effectively, the usual, how do we look at the lineage, how we join in the various parts of our data model And somebody suggested, hey, have you seen Neo4j? And I thought, no, I haven’t And I looked into it and it just seemed like an absolute fit for that And it’s basically all just been going on from there And I find it absolutely fascinating space – Cool, and so I think you’ve been working, I guess, at Neo4j all this time for about under two years? – Yeah, yeah, just over one and a half years now So yeah, it’s been very good, very interesting – Cool, so Lou works on our pre-sales team in London And we’ve had, I guess it was about a little bit under a year ago, we were talking about an interesting dataset that we could play with I guess I’ll let you take over this story from there This is a bit of a long in the making meet up – Oh it’s great, it’s fantastic Yeah, so we spent a lot of time talking about different really great use cases with Neo4j and really common things coming up like recommendations engines So how do we, effectively base the different kinds of collaborative filtering and that kind of thing Entity resolution, so how can you resolve potentially similar things to answer the same with some degree of certainty? Mass data management, why I got into Neo4j to begin with And then we started thinking about things like graph analytics What kinds of clustering we can do and that kind of thing And it’s always great to be able to experiment and work with these different things And I was sort of thinking what dataset could we use where we could start to test out these definitely elements? And it dawned at me that actually that the BBC Good Food website seemed like a brilliant idea So you’ve got a number of recipes, you’ve got ingredients Because you’ve got different contributors, you’re going to have the situation where things like, you’ll have cherry tomatoes, tomato, tomatoes So you’ve got these entity resolution options, you’ve got ways of clustering things out We start thinking about the algorithm side, we can start to think about what are the common core components of a cake, for example So you’ve got a lot of really interesting data interconnected to this that you can apply a lot of these use case elements to And I thought it would be a fun dataset to work with – Yeah so I guess maybe we’ll switch over to your screen now So when we– when we first started researching this dataset, I guess step one was we wondered, do they have an API? That was our hope It was like, there’s got to be some sort of API that will let us get a hold of some of this data Sadly there isn’t and basically, like everywhere on the internet, anyone who has attempted to build some sort of– or do something with this data, they all end up scraping it in some way or other And so I started working on this scraping approach I think it was in May of last year when Lou first suggested it to me And I thought, okay, I’ll just go– I was using Python’s Beautiful Soup Library I was like, oh cool, I’ll just build myself a little scraper and I’ll pull out the ingredients, maybe I’ll get the recipe title and perhaps some of the other information on there And once I started doing that, I realized that it was actually quite tricky to get the actual ingredient itself So you can maybe see on this diagram, some of the ingredients are underlined So like, for example, butter, and flour, and dark chocolate, and eggs But if you take another example, ground almond is an ingredient and that one’s not underlined

And so it’s quite tricky to figure out what are the actual individual ingredients in this So I started off down a massive, massive tangent where I was attempting to train a machine learning model to try and figure out what the ingredients were in this little box, in the boxes on each of these pages And then Lou came up with a much simpler way to do it that actually worked So back to you again – [Lou] So I don’t know happened– I think a few weeks ago, I thought, let’s look back into this And I was just having a look through the source code of one of the recipes Seriously rich chocolate cake here And I spotted this rather cute little JSON object here and I went to Mark, went oh hello, I think we may have a nice way of being able to pull this So, it’s a bit small Don’t worry, we’re going to move onto a page where we’ve expanded this out a bit bigger But basically, we’ve spotted it and thought brilliant And just going to shift over here So this isn’t the entire object here But you can start to see, we’ve got lots of really useful pieces here So we’ve got information about the author, the description of the recipe We’ve got nutrition information You can see here, we’ve got the ingredient objects And this is really conveniently packaged in a JSON object So what we’ve done, we’ve basically downloaded the load of recipes and we just extracted out that JSON object And using that, we’ve effectively just cut out the JSON objects and we’ve placed that into a flat file And this is a really convenient way for us to start thinking about effectively loading this into graph – [Mark] Effectively, we’ve got a file containing a stream of JSON objects, if you like One, I think it’s one per recipe, right? – [Lou] Yes – [Mark] So one per page? Per recipe I’ll share the– well we’ll see it in a little bit when we import the data I’ll share you the link of where we’ve got that JSON in case you want to play with it yourself So we’ve got a question from Jamal Thorn How did you extract that JSON object from the code? So I’ve got a Python script that I wrote that does it And we basically looked, we basically scrubbed all the HTML of the page, found where in the page it was grabbed– and so it’s call is like a piece of JavaScript So you can see, it’s like trying to add an object So it’s doing permeative dot add on, web And then it’s sort of trying to pass in an object into some sort of function somewhere in the BBC code So we grabbed this– effectively the second argument to this function And then we put it into a Python script And I forgot, there was a library we used to basically decode it from JavaScript to Python dictionary, and then we printed it back out again into our file as JSON object So it’s nearly JSON, but it doesn’t quote the keys correctly back, as you can see it needs to have quotes, double quotes around those to be valid JSON It’s literally five lines of code But this was significantly quicker than our previous attempt So I guess we’ll go back again to our other page So yeah, once you get this Nearly every page has those, I think there were a few that didn’t, so we just put some exception handling code in there and if it didn’t have that, we just skipped the recipe But there were maybe single figures that didn’t have it Now once we’ve got that data, we were at the point where we needed to figure out, okay, are we going to build a graph from this? So when you’re trying to figure that out, you sort of have to do, honestly like your goal in the end is I want to get this data into Neo4j But at the moment, it looks– and so now Lou’s put us on page here where you can see it a bit more zoomed in At the moment, we’ve got it as this JSON instruction We need to figure out, how are we going to decompose this or de-structure that JSON into something in a graph? Because we can’t just thrown the JSON at the database You figure it out, you’ve got to come up with graph model for it – [Lou] And with any good data problem, you need to think about how you’re going to model that data, and that’s very much going to be driven on the kinds of questions you’re looking to ask So again, it’s very much quite question-centric about the kinds of questions you’re going to ask And I guess, it’s probably worth talking about what kinds of questions we need to ask this data, isn’t it? – [Mark] So we had a few questions in mind and I guess the thing to keep– the thing to keep in mind when you’re building these models is that you don’t necessarilly need to get it perfect the first time around If we come up with some question to build an initial model, and then after a couple of hours of playing with the data we decide, no, actually I want to do some different, ask some different questions, and the model doesn’t quite work,

it’s reasonably easy to go and refactor that model to do what we want I noticed we got another question, so I’ll address that now Is this project in gear now? Javier, we’ve got the script that scrapes the data in GitHub and we’ve got the data in GitHub as well Actually, the queries and loading it is also available as a browser guide that you can run in the Neo4j browser, but I guess we’re slightly ahead of ourselves, so we’ll show you where that is when we get there So what we’ve got on the screen now is a tool called Arrows We’ll share the link for that in the chat as well But what we want to do is go, okay, from that JSON what graph model do we want to come up with? And the first place to start is, we want to create some nodes So in a graph model, nodes are representing, they’re kind of representing the main bits of data So like our entities if you’re used to that type of modeling – [Lou] Or nouns – [Mark] Nouns, yeah they’re nouns If you’re better at English than me, then look for the nouns in the data And so the most obvious one here is that we know we’re looking at a recipes dataset So I guess we probably have a recipe of some type And so this whole thing is a recipe in some respect So let’s say we’ve got a node for the recipe And we might say, okay, I’m going to have– – [Lou] What kind of ingredients? – [Mark] I was going to start even something simpler So maybe we’ve got a title And so there’s two ways you can do this Let’s just call this chocolate cake, just so I don’t have to do too much typing There’s sort of two ways that you can do this Either you can do the relational style entity modeling where you just model the types of things Or you can model actual individual bits of data And what you prefer is kind of based on what you’re familiar with I don’t know, what about you, do you have any better guides than that? – [Lou] So I think usually the purist view will always be you’re going to probably working with meta data at the end But it’s really hard to work meta data straight away because it’s quite a– it’s a level removed You’re thinking quite conceptually there And it’s a lot easier to be thinking about, I’ve got a chocolate cake, and to make this chocolate cake, it’s going to have, flour, and sugar, and eggs It’s a lot easier when you’re starting out and you’re thinking about the problem that you’re trying to solve and the kinds of questions It’s significantly easier to start from that stage rather than abstracting away and going well, I have a recipe and the recipe has ingredients and that kind of thing So generally, certainly when I start modeling an aspect, I always think about, let’s work with the physical data that we have And use that rationale around that And then once we’ve drawn up, for example here, the chocolate cake And the chocolate cake has flour, and it has sugar, and it has cocoa We’re then it that situation where we can step back and go well, okay, what are the actual components what are the entities that we’re talking about So I think, generally, that’s my approach – [Mark] If we continue from where Lou had taken us to Said, hey, we know that we want to have some ingredients So let’s just copy that ingredients in there And one potential way of modeling this is to go– actually I’m just going to put the ingredients onto my– oh, it’s now a bit wide I guess Maybe I can put a space in there somewhere so it wraps around the line So one approach would be to say, okay, I’m going to put the ingredients as a property on my recipe So I could have an array of property That would be like a valid thing to do And that would be perfectly fine if our goal was, I want to just query the recipe and you show me the ingredients on a page somewhere That would be quite cool But as Lou pointed out, one thing that we would like to do is sort of do recommendations And in here, sort of content based recommendations So hey, show me another recipe that has similar ingredients And that’s quite difficult with this model We can’t go, hey, I want to find another recipe that has ground almonds I’d have to go and scan every single recipe and look through all those arrays to find my ground almonds So that’s not such a good model for the questions that we want to try and answer So we might back out on that suggestion quite early And instead, might say, okay well actually, do you want to query through– I like to think of it as querying through something because if you want to query through those ingredients to find other recipes that might have that ingredient, I want to have it as a node And the name of that, so I went with, I think in our model we’ve called it contains ingredient, but any sort of variant on that, that explains what you’re doing or what the relationship is, is fine So we’re create a contains ingredient relationship And on this side, we’re going to put in the label of this node So a label is a way of tagging a node Tagging or categorizing a node, it’s sort of grouping nodes so that we can find similar nodes as we want to

using the query language, which we’ll get into in a minute And in this case, let’s just put one ingredient So we’ll give it a name Let’s say this one’s called butter And we’ll save that to our model And maybe we’ll do just one more just so you can see how this process works And so on this side we’ll say– apparently my spelling is not great Let’s have an egg So this one will be egg So we’ve got an egg and we’ve got butter And then we’ll put in contains ingredient on that And if we had another recipe, I’ll make one up for this example We could then connect that So let’s say we’ve got another recipe that has egg So can create another relationship over here So let’s just reverse that and let’s say this recipe is called scrambled eggs I really hope there is– it’s a really fancy scrambled eggs if it’s counted as a recipe We’ll see later on, I actually have no idea what I’m talking about when it comes to eggs I assume scrambled eggs is reasonably easy to make I have to rely on Lou for the expert knowledge there So now we’ve got– we’ve got a recipe, we’ve got some ingredients So I think that’s reasonably good That will probably help us to answer our content based recommendation question Shall we pick maybe one more thing on here? What would be another good thing? – [Lou] So, author, you know, maybe– – [Mark] Author, yeah – [Lou] They’ve got a fantastic recipe and we might want to know about other recipes that they’ve composed – [Mark] Yeah, so again, we’re probably going to end up with an author, a node representing author So let’s put one of those in In this case, I think I don’t know whether that– I mean, that name seems okay, like it indicates it’s an author But I suppose you could’ve also gone with something a bit more generic And say, hey recipe is written by, or created by I think we went with author in our domain so let’s go with that We’ve got like a partial question I’ll let that person finish off Who is the– what was the name of the author for this one? – [Lou] Oh gosh – [Mark] You see it? – [Lou] Author is Good Food – [Mark] Ah, so this is actually like one that is written by the BBC’s content team Wouldn’t it be better to have a property on the recipe? Maneesh, you’ll have to indicate what you’re referring to when you suggest property Metesh has asked the question, does this have a very specific application or is it just for visualization purposes? Yeah, so once we get on with it, you’ll see that the actual reason for loading this data into a graph is so that we can do certain types of queries that are quite hard to do when the data is just flat and treated as JSON In terms of this tool here, this is mostly used as an early pro-typing or modeling tool And you can actually, there is a way of actually exporting the cypher that you create here and putting that into Neo4j so it can actually create a graph for you from your diagram But yeah, in terms of this meet up, we’re just using it as a tool for showing how would you go about modeling the data – [Lou] And it’s showing, we’re basically giving you insight into our rationale behind the approach we used to come up with the data model How are we going to read the data into Neo4j It’s just to give insight into that – [Mark] Okay, Maneesh has asked the second part of the question now The reason we don’t put the– wait a second, so the list of ingredients is strings instead of multiple repeated contains ingredient relationships So I think you’re suggesting why do we not store I think you’re suggesting why do we not store the ingredients as a property on the recipe The reason is that we want to query through those individual ingredients to find other things that might have those ingredients So you can kind of see on the bottom of this diagram we’ve got– if we want to find things that have eggs in, or if we find a recipe, we can say, show me all the other recipes that have these ingredients And it’ll show me the ones with the most common ingredients It’s much easier to do that if we have that as a node because we can easily write a query that makes use of that relationship and sort say, okay, take me from my chocolate cake recipe, find the ingredients that I have, now find other things that have these ingredients If we chose to store it as an array, we’d have to write a query that said, find me this recipe, get me all the ingredients, and that’s an array, and then kind of go through every other recipe and look through each of their arrays and check every element of their array, against my original recipe’s array and check whether any of those things exist So it much more like– first of all, it’s more computationally expensive, but it’s also kind of unwieldy to write that query And I guess you would say it’s not really using the graph if you’re modeling it that way – [Lou] And I think to some extent it comes back to what are the questions you’re asking And the questions that you’re asking are going

to dictate what that data model looks like So just to give an example, if, for example, we were– let’s for argument’s sake say we were building a recipe book or something like that And in this recipe book, the things we care about are the authors of the recipes, we care about what kind of mealtime recipe that is But actually we never ask about the ingredients because the ingredients we only ever query and bring back when we actually bring back the recipe And in that scenario it would make very good sense to have the ingredients as properties because we’re never doing any queries that are ingredient-centric Then absolutely that would make sense But in this example here, because we’re looking to do some level of collaborative filtering against ingredients, then it makes sense to pull these out and they become nodes in their own right – [Mark] So I guess on this one, and one that we haven’t pulled out as a node and didn’t, in this example, is skill level But you can imagine, if you were building, if you had like– I think they give quite coarse skill levels But imagine if you had really fine grain skill levels like if it was, maybe it was like a set of recipes for a budding chef to learn – [Lou] Yeah – [Mark] Maybe that would be a more interesting one You could say, hey, I’m actually like, I’m at this level of some sort of chef training And you say, hey, show me what recipes are in my range at once Then maybe that skill level makes sense to pull out – [Lou] Yep – [Mark] So we’ve got a couple of other questions, so I guess let’s address those And I think that probably is enough for the modeling But we can kind of do the same exercise for figuring out what do we want to do with keywords Do we want to do something– what else have we got on there? Courses, cuisine Channel, I guess And so there are un-collections and so you’ll see what we came up with Servio asks, can Neo4j handle billions of nodes? And I’d just say, yeah, we’ve got people who are working in databases with those sizes So yes, we’ve got– it is possible to do that John Gorman, could you model ingredient nodes the other way around? Node, butter, label, ingredient? – [Lou] I see what he’s– yeah, so just to clarify here, so when you see the name inside the node, inside the circle, that’s the label So the label here is ingredients, and this is the property, it just maybe it look a bit counterintuitive in Arrows but absolutely, John what you’re saying here, if I double click here, you’ll see the caption So caption is the label as well, synonymous So the label here is ingredient and the property here will be the key value pair of name to butter – [Mark] Okay, cool Now our last question which will be yeah, we’re going on to on the next slide is can we do an automatic import from the obtained JSON to Neo4j? So we can go from JSON to Neo4j, but we have, as I said, this is like the first step We’ve got to say, okay, for what should I do with that JSON? Like what projection do you want me to do from that JSON into a graph model? So this is like the first step of that process And so I guess that’s probably– you’ve probably got, hopefully got enough to get the idea If you wanted to continue this and add some more to this model But now we’re going to have a look I guess I’ll let Lou talk through a little bit of the import approach – [Lou] Sure, so what we’ll highlight here, this isn’t the complete import statement that we run to get the data into Neo4j So this is just a small portion, and as Mark said, we have the scripts both available in the media blog post that we put out, and also in the GitHub repository So we’ll talk a little bit about what we’re doing here in the script So first and foremost, to be able to run these scripts, you will need to have multi-statement enabled in your Neo4j browser We will show you how to do that in a few minutes so don’t worry, when we’re going through the queries we’ll show you where in the browser you need to set that But you get that set and then effectively what that means is you can execute multiple cipher queries within browser So what we do is we’re going to set a number of indexes And we’re creating indexes on certain properties within our nodes And the reason why we’re doing this is because we’re going to be importing a lot of data and we’re using the merge keywords, it just means it’s going to speed the whole process along And what you spot here, we’ve got the merged keyword here and what we do with merge and what merge does is it goes away, it had to look at the item that we’ve specified So what we’re saying is going to be the unique constraint for the node So for example here, we’re saying the ID is the unique property for recipe And what we do with merge, merge will basically go away

It will go and have a look So if we’ve got indexes it will check the indexes to see does the nodes type recipe have the ID? If it doesn’t exist, it will go away and create that new node If it does exist, it will do nothing So that’s why we put indexes in It speeds up that performance So we do that and the next bit I will flag here is we’re setting a parameter here So params JSON file, and what we’re doing here is we’re giving a link to the web flat file It’s where we got that file of all those JSON files, JSON objects And the reason why we’re doing this is because when you look at the full import scripts, you’ll notice we reimport the file several times And the reason why we do that is we’re trying to make this query as efficient as possible and this stops this concept of an eager query And so very briefly, an eager query hangs on and keeps in memory all the local references of the nodes that we have when we’ve been importing the data Because it’s trying to hold on and say well, I might need to keep this because I need to add the relationship somewhere, or I’ll need to connect something in some way And we don’t really need to do that for this data So this is why we’ve taken this approach Mark has written a very good blog about eager queries and approaches you can do to reduce the eagerness of them And I think we also sent the link out for that, so you’ve got sight of that as well But that’s basically what’s going on here – [Mark] One question from Mita, is around, can we explain the indexes that we’ve got at the top in a bit more detail? So the question is, are we indexing it to make the queries faster? If yes, why are most of them indexed? In this case, the indexing there is actually a part of the import process And the reason we do that is because we’re– I guess we can maybe talk through one of the examples So for example, this merge, that you can see, merge our recipe So what’s that’s doing is, merge is a clause that says hey I want to create this thing if it hasn’t already been created And so you see on the first line, we create an index on that recipe And so what it means is when we go check hey does this recipe exist for this ID, it means that we can just go in the index and check, does the recipe exist Rather than having to scan through every single recipe So this is quite a common thing that people sometimes run across when they’re importing data Because it will get slower and slower as you do more records If we had no index on the recipes, the first we put in, pretty easy, right? There’s nothing there, okay great Second recipe, it’s going to go and scan, does my ID match against one recipe? By the time it gets to recipe 10,000, I’m checking 10,000 When I get to 20,000 I have to check against 20,000 Whereas if we put the index on, obviously we’ve got a log in time to check, does this recipe already exist So in this case, all of those indexes are being used because those are the properties that we’re using as part of the import More generally, separately from the import, you would put indexes on properties that are going to be used that you’re going to start your queries from So if one of our queries was, hey, I want to find all the recipes for eggs, then it would be reasonable to put an index on ingredient name If every one of my queries was hey, find me all the recipes with the name chocolate, maybe I want to have an index on recipe title or recipe name So we don’t actually have that at the moment We’ve got another question here Well actually we’ve got a few questions I guess we’ll go– maybe we’ll go from the top John clarified a question from the previous slide Could you make the label butter and the property ingredient? So the butter is always butter even if it’s not used as an ingredient So in this dataset, it is always used as an ingredient, but yeah, if you were trying to, if you had a more generic dataset, where sometimes it can be used as an ingredient and sometimes it was used as something else, that would be valid One way of doing that would be to come up with sort of a more generic label So there’s a more generic way of describing what butter is Perhaps you’d defined it by what the family of things it comes under That would be valid Equally, like another approach people sometimes take is actually to do multiple labels and tag the node with each of the roles that node’s playing So it’s playing a role as ingredient, cool we’ll take it as an ingredient If somewhere else it’s playing the role of, I don’t know like– what would be a better example? – [Lou] Okay, one extension we could do here So let’s say for example, we wanted to do ingredient families We’ll take that butter as an example

So let’s say– so we know we’ve got butter, but actually it would fit into a hierarchy of cooking fats, and within that hierarchy of cooking fats, we would have olive oil, sunflower oil, grape seed oil, coconut butter, butter, et cetera, et cetera And what we could do, is we could say, well we started off going, well actually, butter, olive oil, et cetera, were ingredients But further on down the line, maybe we’ve done some analysis and said, well actually, yes they’re all ingredients, but they’re also all types of cooking fats And then what we could do is we could add another label saying cooking fats So you can add multiple labels and like this we start to get different viewpoints and slices and hierarchies based on the kinds of questions we’re looking to ask – [Mark] So Maneesh had a question about the syntax of this query I think it’s just ’cause maybe the text is a bit small, but we do actually have curly brackets in there as you suggested, you can’t put parens in for the properties And then Marco has a question Quantities and cooking methods, are they entities or relationships? So I think, what you’re asking really is, are they properties or are they entities connected by a relationship? And so in this case, I think we’ve just set those quantities and cooking method, we’ve just set it as properties of the node And then, Maneesh observes, that yeah, we’re using, as you point out on the questions, the APOC library, so awesome procedures on cypher And there’s a load JSON amongst many other data formats procedure in there that we can use to process, in this case we’re processing JSON, but you can also load data from JWBC, so you can get data from any relational database using that You can load XML documents, I’m sure there are many other that I’ve forgotten And then we’ve got another question from Ashven Is it possible to mention all properties in the merge statement instead of set? Yeah, that’s another thing that people sometimes do when they first pick it up is to put all the properties inside the curly brackets It has a slightly different syntax than what we’re doing here So anything that you put in the curly bracket it uses every single property that you put there as, effectively like as your unique of this node So we’ll go and check, can I find a recipe with this ID and this cooking type, and this preparation time, and this name, and this description, and this skill level, which as you can imagine is quite slow, as your number of recipes increases because it needs to go an check against every single recipe to make sure each of those values matches Whereas in our case, we know that the recipe is identified by that ID property, so there’s no need to put all of them in there So we’ve kind of achieved the same thing Just more quickly And one other thing, which Lou pointed out when were talking about this just before we started was with the set command, you can also use – [Lou] On create – [Mark] On create and if you only wanted to apply the properties when the recipe gets created And there’s also another variant called on match, which would only run if it was already there So you might want to use that if you said okay, I want to– I want to keep track of the last time this recipe was found in my import script So maybe we keep a timestamp or something – [Lou] Or a counter every single time somebody does a search for it – [Mark] Cool, so I think that’s probably– I think we’re probably good enough to go to the next bit And we’ll share the links that Lou’s described as well So yeah, I think we’re ready to go – [Lou] Go into graph – [Mark] To the Neo4j browser so we’re using an instance of the Neo4j sandbox here And we’ve made a guide which contains everything from this session, in case you want to play around with it afterwards And you can type in :play recipes I’ll show you real quickly what happens there And it will give you a guided, sort of walk through tutorial that you can follow So you can click on these buttons here So this is the script This is like the full script Lou showed a little portion of it in the slide So if we go to the top, you can see here we define the indexes We create a parameter to indicate where our file lives and this is where the JSON, the stream of JSON objects lives You’ll notice that we’re actually executing lots of queries all in one statement And so if we go down the little cog, on the bottom left, by default, enable multi statements query editor is off, and so if you want to use that feature, you’ll need to turn it on Otherwise it’ll tell you, hey, your cipher has a syntax error And so you can see this is quite nice

If you’ve got a script, previous to that feature, you’d have to go and figure– you’d either have to manually feed your script in or you’d have to use the Neo4j driver and feed it in like that Or you’d have to go and use the cipher shell and feed it in like that So this is a significant usability improvement for hey, I just want to load this data and I’m using this browser tool I’ll just quickly show you what else we’ve got in here So we’ve got, you can use a schema query I guess, maybe we’ll show how to run that So just a quick schema query So you see, so this is what Lou referenced before I guess the proper way of doing data modeling would be actually we need to work out what entities have we got, what relationships connect between them And you’ll notice in that import query, there’s nowhere where we do like a hey, I want to create this schema and I want to make sure these properties exist The schema kind of comes with the data So as you put new data in, the scheme evolves with it So if we were to create– to write another import script that added in some other node labels, we would see those appear on this diagram And so you can kind of see our full diagram here So we’ve got a recipe A recipe is basically at the center of this universe It connects to collection, and diet type, and author, ingredients– and then I said collections so I won’t be saying that again And then once we’ve done that, we can run all sorts of queries on here So I’m going to hand back to Lou to talk us through But if you want to play with this yourself, :play recipes is the command And you can run that on a Neo4j instance Anywhere, either you can use one– if you search of Neo4j sandbox, I can also put a link there while Lou is showing the queries Or of course, you can do this on a local Neo4j, it will pick it up there as well, if you run a Neo4j desktop instance, you can do it there too So Lou’s now going to just take us through what sort of queries can we do on this dataset – [Lou] So got some basic queries So a top tip to all, what you can do if there’s certain queries that you’re working with or you run several times in your browser, when you type your query in here, you can press the star on the side, and what that will do is it will save it to favorites So for example, let’s do call db.schema And let’s say this a query I like to run very often So I’m just going to quickly give it a comment And what we can do is we press the star here, and it will save that query So when you click on the star here, you’ll see it’s saved my query So rather than me having to type to again and again, I can just click it and it’ll save it And used this approach here for today’s session and I’ve put in some queries here on the BBC Good Food Many of these queries, again, you’ll see if you run play recipes that Mark showed earlier So, let’s start off with finding some recipes that contain chocolate Who doesn’t want some chocolate? – [Marl] We can tell what your favorite food is, right – [Lou] Can you tell my motivation? It’s very simple here So we’re creating a pattern And for those of you who aren’t so familiar with Cypher what we do is we’re creating a pattern So you can notice here, we have this common feature of the colon both in this part here, which is the nodes and this part here, which is the relationship And basically, everything to the right of the colon it will either be the node label or the relationship type And anything to the left of the colon, if there is anything will be a variable that we’re creating So basically what happens here is what we’re saying is match recipe and whatever matches, the argument will have a variable placeholder for the node here called rec And question that we’re asking here is match recipes and store those recipes that match into this variable rec that have a relationship with contains ingredient with a node type of ingredient and inside the curly braces, we’re stating some properties here So here what we’re saying is, we want the ingredient to have the name of chocolate And some of you may be more familiar with seeing– so if you were to use where clause, then some of you may be more familiar to you with something along the lines of where i dot name equals chocolate Instead of using this bit here So either way works So this is the equivalent of the previous query But here, we’re going to look for recipes that contain chocolate And we’re going to return that the name– so recipe name, that we’re storing in this variable rec And the identifier So if we run that, we’ve got some recipes here Now, taking it a bit further, what’s caught my eye, what did catch my eye, I think it was the

there were some truffles – [Mark] Yeah it was the truffles I think you like the truffle star cake – [Lou] Yes, the chocolate truffle star cake – [Mark] It’s a shame we can’t get pictures into the Neo4j browser, although this would be an extremely hunger inducing bit of Cypher – [Lou] Maybe that’ll be next time, we can write an application that would show the pictures – [Mark] Yeah, yeah I don’t know if we’d actually finish it though Oh, I’ve got to make a cake! – [Lou] So looking at this, what we’re going to do is I’m going to take the recipe ID for the chocolate truffle star cake So I’m going to grab that And I want to find out what the ingredients are – [Mark] Yeah, ’cause you’re trying to check do I have the ingredients in my kitchen to make this? – [Lou] I know, could you imagine the disaster if I go in there and discover I don’t have any vanilla or something – [Mark] I don’t know, let’s see, what ingredients does this need So it’s a little bit similar to before – [Lou] Yep, absolutely So same idea, we’re putting the recipe identifier in here as the property that we’re interested in And again, what we’re doing, we’ve got the variable here, i And we’re going to pull these back in again What we’re doing is we’re going to return the ingredient names here and we’re going to order them So I want them in alphabetical order so just order by i dot name – [Mark] So in this one, the i, does that matter? Or can it be anything you want? – [Lou] It can be anything you like – [Mark] Okay, cool – [Lou] I could’ve called it Dave, but I decided to be a bit more conventional and call it i – [Mark] If there’s any Dave’s watching, they’d be quite happy to see their name in the query Cool, so what have we got there? So we’ve got chocolate, that’s good We’ve got chocolate and dark chocolate – [Lou] And cocoa powder – [Mark] Yeah, so I guess that points out– and white chocolate This is like a thing that we’re not going to cover today but as an interesting future extension is there’s loads of types of chocolates and there’s nothing in the dataset that indicates, hey these things are actually all the same – [Lou] Yeah, or that they belong to a family that all have got chocolate in there So absolutely, that’s something we definitely want to look at soon – [Mark] Before we go on to the next one, let’s just loop back because we’ve got a question from Leo Vlogs, can you explain very quickly what is an APOC library please Leo, I’ve put about two comments above where you asked that question And I’ve put a link to the documentation for APOC So APOC is a plugin library for basically extending Neo4j with extra functions and procedures I won’t attempt to summarize what is in it because there’s more than 400 or maybe even 500 now different things But it kind of has stuff like data integration tools, data loading tools It has utility functions for massaging data It has data clean up tools I don’t know what else Can you think of anything else? – [Lou] Time to live of virtual graphs – [Mark] There’s lots of different things in there If you use the Neo4j desktop, APOC is one of the plugins you can get by default So there are three of them There’s APOC graph algorithms and the graph QL plugin If you create a project in the Neo4j desktop you’ll be able to see in there, you can just one click install the APOC library I’ll send you– I’m going to put a link to a blog post that our colleague Jennifer Reif wrote Where Jennifer sort of talks through, step-by-step how do you install plugins, and what the different plugins do So that might be helpful John asks, to your point about chocolate variations, is it easier to do that if chocolate is a node or property? So this is around can we figure out that all those things are chocolate – [Lou] I’ve got a bit of thinking around this and there’s a couple of ways to do this In a purist sense, chocolate, white chocolate, dark chocolate, they’re all properties because we’re doing that key value pair of name colon and what the value is Now there are a few ways we can look at this And the real fun thing with regards to that is we either do the sort of normal things where you may look at things like Levenshtein distance or something like that where you see how different the words are But what we’re quite keen to do and definitely this would be coming out soon when we start looking at different approaches is how can we use the graph and the relationship between, say the different types of chocolate that we can start to put levels of certainty to say that these are all part of the same group So watch this space – [Mark] And then we’ve got one more question, I guess before we do our next one So from Ashven, can you explain the relationship direction? So in Neo4j, all relationships that you create need to have a direction so that we say it goes from a source to a target name or a start to an end node It doesn’t actually matter, like in terms of querying you’re able to query outgoing and incoming relationships pretty much the same There’s only like ones– but you would normally want to indicate which direction am I working with

The classic example of that is image that you’re modeling the Twitter graph And you want to capture the followers relationship Let’s say we’re doing like a really popular person So let’s say we’re modeling Lady Gaga We’re going to say Lady Gaga follows these people Lady Gaga is followed by these people We don’t want to accidentally leave that direction off because there’re going to be like 70 million people who follow Lady Gaga, and if we’re just trying to find who does Lady Gaga follow, we don’t want to have to go and have to look through all of those people In this case– the way I came up with that model is I just thought, how was I thinking about it And I thought okay, the recipe contains the ingredient Equally you could probably argue the ingredient is in a recipe I don’t know there’s necessarilly a correct way to do it – [Lou] As long as you’re consistent As long as your consistent – [Mark] Cool, so shall we go back to our– so we’ve got some– we’ve got quite a lot of ingredients we need to get into the kitchen to make this cake – [Lou] It’s looking a bit scary So let’s take this a little bit further We start to think about– and this is a very, very basic recommendations query But just to give you a flavor, so going back to that chocolate cake, so let’s say we really like that recipe and we want to find out what other recipes did the author of that chocolate cake write? And how we do that is you can see here, we’ve got our reference to recipe and here is the recipe identifier here And then what we do, is we’ve got this pattern So we’re going from here, but we’re saying well the recipe was written by this author And then what we’re saying is what are the other recipes has that author written? So we’ve always got these sort of palindrome pattern going on here And what we’re going to return is the other recipes that this author’s written and the IDs So we’ve given it a start point here which is this recipe ID We’re saying our pivot point is this author and then give us the other recipes that the author has written And again return the recipe name and ID And there we go We’ve got a list of these – [Mark] It’s disappointing that this person is not like a mega chocolate cake – [Lou] I know – [Mark] Just only reviews chocolate cakes – [Lou] Right, should definitely look into that So I think we’ll hand over to Mark and we start looking at how we can be a bit more specific with the recipes – [Mark] So then the next one we want to have a look at– so this is kind of like, what can I make if I’ve got some specific ingredients And what can I make if I’ve got someone who’s really allergic to stuff, like me who’s coming around and I want to make sure I don’t kill them in whatever I cook So we’re going to have a look at how can we find things that contain ingredients, and how can we have a look at things that don’t– or exclude ingredients So the simplest way would be to actually just do something like a query like this So what we’re saying here is hey, I like stuff that has chili I’ve got very simple taste And I want to say, hey, show me the recipes that contain the ingredient chili So we’re doing kind of a similar thing as Lou showed us in the previous queries This time we’re just looking up, find me all recipes where, so the where clause is for sort of filtering the results So what we’re saying in this case is I want to find nodes where there’s a relationship from my recipe called contains ingredient and on the other side there should be a node with the ingredient label with the name chili I.e. in English, find me recipes that contain chili So it’s reasonably simple, I guess And we get back lots of different things So depending if this is like dinnertime, maybe you get some ideas on here But you can see we’ve got green apple salad, we’ve got Korean fried rice, we’ve got all sorts of things Spiced smokey barbecue, barbecue chicken, so it’s really broad This is not– I guess you probably wouldn’t want just chili as a filter, but maybe we’ve got two things we want to filter Suppose I fancy chili prawns This would not be for me, I can’t eat prawns either So this is not a great one for me But like the simplest way to do that, to extend the query would be okay, we’ll just add another and statement and we’ll find ingredients that contain both of them So that works You can see now we’ve got things that contain prawns and chili Probably quite tasty, guess if you’re in our timezone or a little bit ahead it’s sort of coming up to dinnertime Disadvantage to this is, what happens if I want to look for three ingredients or five ingredient? My query is not very flexible But we can make it We can refactor it to make it more flexible But the way we would probably want to use it, is actually we might want to pass in the ingredients from a parameter So Lou showed this before This is how you can set parameters in the Neo4j browser So what we’re saying here is I want to create

a parameter called ingredients And I’m going to give it three ingredients, egg, onion, and milk So let’s just set those up There we go, we’ve got our parameters And then we can– so I’ll just scroll up a bit We can use a function, it’s a collection function called all And what it does is you can use it to basically iterate over a collection of items So in this case it’s those ingredients, so egg, onions, and milk And then it checks, does a predicate on those– on each of the values in that collection Is it true? So every single predicate that we execute needs to be true And the predicate, in this case, is just some piece of code that returns true or false And in this case, it’s this piece of code that I’ve selected here So it’s basically checking, does there exist, I.e. is there a relationship called contains ingredient from the recipe to an ingredient with the name i Where i is one of the values in this list And it just goes through this list and check for ingredients that contain all of those things So these are things that contain milk, onion, and eggs And if we wanted to change that and say hey, actually I want to find things that contain egg, onion, milk, and apples I can just update my parameters And one of the nice things, it’s been in the Neo4j browser for a while, but you can just hit a refresh on that And now there’s only one thing So have to have country style toad in the hole, if I want to make sure that I have every single one of these ingredients You can also vary this, so you could say, I think it’s any, will check that it has at least one of these ingredients But I wanted to finish this section off with is to show you another function called none, which is really handing for excluding things So I made an extension for this query here So imagine we want to make something with coconut milk and rice, but I’m allergic to eggs and milk So we’ll make another parameter called allergens And then this time, we can find a recipe for me Let me just make that a bit bigger maybe So we’re saying find me a recipe where it contains all of the ingredients in my ingredients parameter, so that’s exactly the same as before, but none of the ingredients in the allergens list And then the rest of the query is exactly the same as before There we go So these are some of the things that I would be able to have Again, with the query we could say, imagine we had someone else coming around who’s allergic to mango, for example So you can see number three has mango in Then we want to get rid of mango for our list too Hitting the wrong key, there we go So we just add mango to the allergens list And you can see when it runs these, you get a little tick to tell you it’s done You can have a look, it says hey parameter was successfully set That’s great And then we can run it and now the mango will be gone from our listing so we’ve filtered it down a bit more So I guess maybe– – [Lou] We have some questions – [Mark] We have some questions And then we can wrap – [Lou] So we’ve got a couple of questions about how would you manage amount properties for ingredients And yes, absolutely It always comes back down to the kinds of questions you’re asking and how they’re going to be referred But absolutely, you could in the contains ingredient relationship, you could put the property on there, and the property could be amount or quantity And then perhaps you can specify what type So it could be for example, amount 25 and type could be grams, or something like that So absolutely, that’s one approach you could do If there was something a bit more generic, so let’s say, for argument’s sake, all of your recipes are always going to be in quantities of let’s say, cups Sort of a more imperial U.S. style place And let’s say for example you always have very specific things, then for speed, you might say, well actually what I’m going to do is I’m going to change my relationship type from contains ingredient to contains three cups of, contains half a cup of, that kind of thing And that way you have a much faster query speed because you’re not having to scan through the relationship properties That’s a slightly more complex thing where you’re looking for really fast performance But absolutely, you can attach properties onto relationship for the amount So that is definitely one approach to do that – [Mark] Then there’s a question about the pipe So the pipe character So we’re not actually using it as a pipe operator, in this case It’s effectively being used as a projection So just to show you a really simple example of how you can it It works on collections So if I said I’ve got a value in a collection, so imagine I’ve got a range of numbers from zero to ten I could use the pipe to say

I want to double each value So you see here, I said give me the numbers from zero to ten And then apply this times two on each item And so what we were using it for in the other place was to just grab the name for each node rather than– so I guess, this is here This here is called a path expression So what it let’s you do is you can take some part of an earlier part of the query And kind of apply– effectively like a match query to that node in this case And get what’s at the other end which is an ingredient And then just get the name of the ingredient and it will give you it back as an array of values And you could chose to say, hey, just get me the first ingredient, for example like this Or you could say, hey, get me the first five ingredients like this So the syntax is sort of, it’s a little bit based around graph QL and it’s a little bit based around how Python collections work Is this being recorded? So Jamaal, is this being recorded? So yeah, this is recorded exactly at the same URI where you are now Alec, for your question about can you show how to use the similarity algorithm to find similar recipes That’ll be, I think for part two of this – [Lou] Part three – [Mark] Or part three – [Lou] Part three of the media blogs, part two of the video – [Mark] Okay, yeah And then does Neo4j have a way to manage recipes in different languages that can connect, say to an English equivalent? That’s a good question – [Lou] I think that’s going to be a modeling exercise So I think generally, we’ll need to investigate, but generally there will probably be some work probably akin to a knowledge graph where maybe you have eggs and then maybe you then connect it to what a French equivalent is for œufs and maybe– like that kind of thing And Spanish huevos So effectively, you may well be– you may be mapping that So some of that will be a modeling exercise about how you want to represent that And then potentially you can sort of do it around that way – [Mark] And then Leonardo, for your question about Neo4j 3.5.3 and Neo4j 3.5.2, I guess this is probably not the easiest place to debug that So I would suggest, at the top of this chat, I’ve put a link to the community site And I would suggest if you can post it in there and sort of show what error message are you getting, and then we can– either Lou and I or somebody else if they get there quicker can help you figure it out Kat asks, is the pipe similar to a Python lambda function? I guess it is a little bit So you’re effectively on the other side of the pipe you’ve got, in our example, in the value example that I gave, you have access to do whatever you want with value So I guess it is pretty similar to the lambda function What else did we have Did we have some other questions? Can Cypher read rules and update the database like a rules engine? I would say this, not directly in Cypher, but in the APOC library that’s linked a little bit up the chat, there’s a concept of triggers So you can set triggers to apply pieces of Cipher based on something happening And I think you maybe need to have a look at the documentation of that and see whether it achieves what you want I think that would be the closest we have for that – [Lou] And our colleague Max De Marzi has written some very interesting blogs as well When you want to write some really heavy duty rules engines and things is what he’s written a lot of content around that too – [Mark] Okay, another question Could you please add how would you go about classifying a list of ingredients to a collection of diet type So there they’ve actually given it to you In the JSON, they already told us So you see here that I’ve put on the screen on the right hand side They’ve actually already told us what is– which diet type this belongs to So what is it somewhere – [Lou] Some like indulgence – [Mark] Yeah, so they’ve given us– so these are the collections So all we did for this was we just took that array of collections and then we created a node connecting to it Oh you mean like a machine learning type classifier? That’s going to be like part four I guess, if you can, again maybe the community site is the best place if you can describe in a bit more detail what exactly do you want to do If you can narrow down the problem a bit for us we can try and address that in the future Is there any plugin that can do predictions or work with neural networks with the data stored in Neo4j such as an integration with tense float or CAS? Our colleague Andrew Jefferson and David Mack have been doing some sort of, I guess, work around this

And so I’ll link to that They’ve written a bunch of blog posts They are called Octavian AI So if you search for that, you’ll find it But I will put a link there I guess maybe we can do a quick wrap up now and then if you have any other questions we can handle those on the community site afterwards So, I’ll stop sharing this screen again Come back to us Cool, so I guess I’ll let you wrap up So we’ve talked a little bit about some other things we want to do – Yeah absolutely, so I mean, it’s just sort of what we’ve shown you today with showing you how to model and import the data We’ve given you sort of insight into kinds of questions we’re looking to ask And this is pretty much going to drive the data model that we’ve done Maybe in the future we may need to tweak it slightly Actually we’ll talk about that We’ve given you a very, very simple idea of how you do for recommendations engine, how can you include and exclude ingredients Please, you’ve got the play recipes guide Go have a look, have a play What we’re going to be exploring next time is how can we start to do different kinds of recommendations So recommendations could be for example on certain ingredient, it could be an author There are various elements there So that’s what we’re going to explore Have a go, have a go beforehand And let’s see how we contrast and compare – And I guess before we end, if you enjoyed the talk and you think it would be useful for other people, don’t forget to like the video on YouTube And we’ll be back again next week Next week, our meet up is on a Tuesday So I think it’s a little bit earlier It’s probably a little bit early for the west coast in the U.S. but it should be okay for other timezones I’ll be with my colleague Michael Hunger and he’s going to be talking about how we can build what he calls tiger overflows So he’s using– so this might be interesting if you’re interested in similarity algorithms in Neo4j graph algorithms library We’re going to be showing how you can apply those to the tagged data on Stack Overflow to find taxonomy in there We’ve got a final question Leonardo, I’m interested in AI with Neo4j Yeah, maybe bring your question to the community site and we can talk about it a bit there I’m currently working on a data science machine learning online training course that we’re going to be launching hopefully in the next couple of months So maybe that’ll be interesting for you and give you some ideas Otherwise though, I think we should probably wrap up because it’s probably time for people to go home or get back to work So thank you for spending the last hour with us And I guess thanks to Lou for the idea and all the work around building this And we’ll see you all next time Thanks, bye bye! – Bye!