Ubot Tutorials v4 – Scraping 101

Just another WordPress site

Ubot Tutorials v4 – Scraping 101

hey everybody it’s John from larnouba calm and I’m bringing you the latest tutorial from learn you buy calm and what I wanted to go through today after several requests is a scraping tutorial that I’m going to use this site that you see up here and we’re gonna go through it step-by-step this is going to be a complete walkthrough of how to scrape and I’ll tell you what we’re going to scrape we’re going to scrape the title of the movie the thumbnail image as well as the large image so we’re going to type those three we’re going to scrape those three things and I’m like I said I’m going to walk you through this in order and hopefully you will get some insight as to how you can accomplish some of these things just for the record I am NOT going through all the pages of 66,000 items for the purpose of this tutorial I’m only going through the first page but in order to go through all of them you would simply put the processes in a loop and repeat the process as needed okay so the first thing I’m going to do is and this is not required but for me I like to do this is I’m going to add some stat monitor so I can see some of the numbers so I don’t have to keep referring to the debugger so let’s go ahead and make a stat monitor for some of the list tools because we’re going to be adding this stuff to a list we’ll call this one movie titles and we are going to monitor the list total and we’ll call this list movie titles okay and there we have it we’re set to zero okay let’s go ahead and add two more this will be thumbnails we’re gonna do the exact same thing and we’ll call this one thumbnail URLs and the last one will call full-size for the full-size images and that will have all large image URL now as you probably noticed about my list totals from the names of the lists we are initially going to scrape the urls of the images and that can be beneficial in a couple of ways for the purpose of this tutorial it’s just one way that you can scrape these but with the URLs if you were posting someplace else and and wanted to hot link the images if that was possible you would be able to do it because you would have the URLs already handy I’m not saying you should do that I’m saying that if that’s what you wanted to do you would be able to do that okay so what I’m gonna do is I’m going to set this up in to custom commands which in version 4 we call define and the first one we’re simply going to call it scrape data and click ok so all of our scraping commands are going to go inside of this custom function and the very first one is we have to get to this site so we can go ahead and just drag in to navigate and it’s going to bring us to this page that you see here and just so you know I am posting the code for this tutorial you will find it below in the description ok so at this point if we wanted to see what we have and you can do this whenever you want under custom commands

you can see that I have scraped data if I just pull that in now that will run that and at this point we should simply just be navigating to that page which we are okay so what I’m gonna do first is I’m using three lists on the table and the very first thing I’m going to do is I’m going to clear them so we’ll go ahead and clear we tools I think I said three looks like it’s for something rose and we have one table to clear which we haven’t created yet and that one we’re calling movie posters just conveniently that’s all okay so now that we’re on this page and again this is just me I like to do all my clearing first keep things organized as possible so it’s going to clear the table the four lists and then navigate to this site so we’re already here so what we want to do the very first thing we want to do is scrape the movie titles so we’re going to go into our data and we’re going to use an add list to list and we’re going to use the list movie titles and we are going to scrape their attribute now I’m going to choose the title and as you see this has set an offset and we don’t want that we see there is a unique class for product title with an offset so we don’t want that so I’m gonna hit the advanced editor and I can see the class is right there and I’m gonna go in and I’m gonna choose just the class and I’m gonna add it and it’s going to remove that offset we don’t want our offset because things can change with offsets and it could be unreliable the attribute to scrape I’ll scrape the outer text now here’s the really important part you need to go in this particular case you need to go into the advanced and where it says delete duplicates you need to change this to don’t delete and the reason is we do have duplicate titles with different posters and there’s a couple of them in here if you don’t do it your numbers are not going to come out even in this case you’re gonna if you keep delete in here you’re gonna end up with 31 titles and 36 images which is not what we want so change it to don’t delete and we’re good so now if I run this and you notice there’s three rows of or twelve rows of three so 36 is the number we’re looking for so if I run this my movie title should say 36 and there it is 36 and if we go into the debugger and we look at it you’ll see Rocky The Godfather Jaws Fear and Loathing in Las Vegas etc Rocky The Godfather Jaws Fear and Loathing in Las Vegas so we have those scraped in the correct order so what I’m going to do now is I’m gonna start creating my table so what I want to do is I want to add this list to a table as a column and our table is movie posters we want this to be in the first column so that’s going to be row 0 column 0 and we want to add the list movie titles and that’s it we’re done with this with the titles so what we’re going to do now is we are going to create our second list which is our thumbnail URLs and the way we’re going to do that is the same way we’re going to use add list to lists only now we’re gonna have thumbnail URLs we’re going to scrape the attribute and I’m going to choose a thumbnail and

we don’t want this one because it’s specific to that so again we go into our advanced editor and you see that there’s a class for the thumbnail so in this case again I’m going to choose the class click OK but instead of the out of text now I’m gonna choose the source which is the URL of that thumbnail and now I’m going to do the same thing is I’m gonna add this list to my table as a column which was called movie posters we’re gonna start them all in row 0 so they start from the top but this time we’re gonna put it in column 1 if you look up here we started in column 0 that’s the first column as the names now the second column is going to be the URLs for the thumbnails and I’m going to click OK now the last thing that I’m going to add to a list are the movie poster URLs but they’re not the actual URLs to the poster yet because what this what we need to do here is we need to go to each page and collect scrape the data for the large image which as you can see is there so the way I’m going to do that is I’m going to go back to my add list to list I’m gonna call this one movie poster URLs we’re gonna scrape the attribute this time we’re going to choose the same class we chose before add that except now we want the href so that it gives us the URL to its page where the large poster is is located and that’s it this is the list we’re going to use to get the other URLs so now what we need to do is we need to loop through this list and start getting the URLs for these 36 images so we’re going to do that very simply by adding a loop we’re going to use the list total for the movie poster URLs which we already know is 36 an insider loop we’re going to start to navigate we can delete what’s in the URL because what we want to navigate to are the list items so I’m just gonna use next list item from movie poster URLs click OK and right now because it’s going to change pages for the purpose of developing and creating this script I need to know what the information is on the next page so what I do is I’m going to I’m gonna put a pause script in here so that it’ll navigate to the first one and I can see what I need to do next so from here we can just run it we have 36 movie titles 36 thumbnails and now we’re on the first page so I’m scrolling down now what I want to do because it’s one per page I no longer want to add a list to a list I’m gonna add this item to the list let me stop the script can’t make changes while it’s still running so I’m going to add item to the list and this is the list we’re going to use large image URL and the item that we’re going to add as usual is going to be our scrape attribute because that’s what we’re doing and we’re gonna look for some unique identifier for this particular image so let’s choose it okay main image shadow is the class

because it has a shadow underneath it if you can see that so we’re gonna choose the class class and what do we want to scrape well just like the thumbnails we just want the source which is the URL of that image so I’m going to click OK okay and finally get rid of the pause I’m gonna go outside of my loop I’m going to add my list to table as column because we want it to add it after it’s collected all of them that’s why I go outside the loop the table is movie posters the starting row is zero and the column is two so that’ll be the third column so the first column is movie titles the second title the second column is the thumbnail URLs and now the third column is the full-size image URLs and the list that we want to add is our new one large image URL which we’re going to click OK to close that and our final command is we are now going to save this to a file and I’m going to use just so it’s universal I’m going to place it on the desktop and we’ll call it movie posters dot CSV and the content that I’m going to save is my table okay so this is the first half of our tutorial we’re just collecting all the data the second half I’m going to download all the images the thumbnails as well as the large size images but for now let’s go ahead and run this you can see up here grab the first one now let’s grab the second one now if you have the pro license of you bought studio you can actually use multi-threading and and navigate to these pages a lot quicker but for the purpose of this tutorial I’m just giving an idea of how the actual process works for collecting the data and downloading the images etc so we’re just going to give that a minute okay and we are winding down one more to go and there it is and that was our second Godfather poster if member the first one was in the first row okay so let’s go take a look at what we got movie posters okay so you can see we have all the names all the thumbnail URLs and all the full URLs and there seems to be an issue with one of them but other than that you will notice that this all worked as planned and these are because they scrape in the proper order these are all relative to each other so under rocky this is the rocky thumbnail and this is the rocky large image this is the Godfather thumbnail and the large image etc so work that well so what we’re going to do next now is we’re going to download all the

images all the thumbnails as well as all of the large images and again I’m going to create another custom command and we’ll call it download images because that’s what it’s going to do and the first thing we’re going to do is technically if you run this all at once you don’t have to do this I’d do this just for the sake of keeping it clean I’m going to clear my table and I’m going to create that table again from that file that we just created again I used special folders desktop so that it’s universally usable by anybody and I think we called it movie posters dot CSV and let me just make sure yes movie posters that CSV and we’ll recreate that table okay now I’m going to set a couple of variables I’m gonna set one called thumb with an underscore oops I’ll call it thumb and then set it to thumb and I’m going to set a second one called bot and that is going to be bought with an underscore and you’ll see why I did that in just a minute and now I’m going to set my row position for my loop I always do it this way there are other ways to do it this is just the way I prefer I like to set the position and increment it myself okay now just so we don’t clutter up our desktop let’s go ahead and go into our system and let’s create a folder and put it on the desktop and we will call it movie poster images click okay I need to set one more variable we’ll call it the path and yes you guessed right this is going to be the path to that directory so we’ll start out with desktop this way I don’t have to keep retyping it I can just use the variable desktop movie poster images movie poster images and a forward slash and that’s it now we can start our loop in our loop the number of cycles we’re going to use is going to be table total rows because we are looping through a table of course we know it’s 36 so that’s how many times we’re going to loop through it our first command is going to be download file download URL is going to be table Cell movie posters the row is going to be the row that

we’ve set on the first cycle it’ll be 0 because we set it to 0 up here and the column is going to be 1 because that’s where the thumbnail URL is we’re going to save it as path which is this path and thumb see I’m putting a prefix now I have to give it a name so what I’m going to do is I’m going to go back to my table cells remember we have the names in the first column and that’s what I’m going to use movie posters we’re going to use the appropriate row only this time we’re going to use column 0 click OK and then after that we have to put our dot jpg so the first one we know is rocky it’s going to be thumb underscore thumb or up sorry the path is the file we’re going to save thumb underscore rocky jpg that’ll be the first one this one will be thumb underscore the Godfather jpg etc etc that way we know it’s the thumb when we look at it and then the large images have the prefix bot so now you can see why I did that so it’s the path which is desktop movie poster images with the forward slash by the way notice that I added that in the path if I added it in the path and I don’t need it here so it’s path filming table-cell jpg right okay I’m gonna confuse myself okay now we’re going to download in the same fashion we’re going to download the large files and the download URL we’re going to go back to our table sell movie posters the appropriate row and four column is going to be two and that’s going to be the URL for our large image posters now save as we’re going to do it the same way it’s going to be path followed by the large that’s going to give us this followed by bot underscore followed by the name which is our table so we go back to our table sell movie posters row is always going to be the appropriate row so that it’s working in the right one and in this case the column is going to be zero because that’s where the name is and at the end we will append our file type our file extension and you can see these two look the same except this is going to download the thumbnails and this is going to download the large images the only thing we have left to do now is increment our row inside of our loop and click OK and that is going to be it for our downloading images so since we’ve already scraped the data and we already have that CSV file I’m not going to make you watch that again let’s go ahead and I can’t believe I just did that all

right well no big deal I just deleted that defined command when I meant to delete this one but we have the download images so we can run that one it’s okay because I have this bot made over here already I just wanted to walk you through each step okay so let’s go ahead and run this and download our images and on our desktop as you can see it created the file let’s see if it’s done it is done when we open our file we now have bots and if we scroll down we have thumbs for each one filmy teeth um fear and loathing thumb gone with the wind thumb Goodfellas bot King Kong bot jaws bought Goodfellas by the way you see some that don’t have them that’s because I ran this before prior to putting on the prefixes so there was some old data in here there’s more than 72 pictures in here but in any event that’s it that’s all there is to it we got all of our images downloaded we got all of our data scraped our title we now have saved if I can find it we now have all of our URLs saved in the event we need them for something else like hot linking I don’t say I condone that I just say if you want to do that and that’s it that’s all there is to it enjoy I hope you enjoyed I hope you learned something from this tutorial and I will see you next time on learn you bot comm have a great day don’t forget the code is down in the description area