Bioconductor Workshop 1: R/Bioconductor Workshop for Genomic Data Analysis

Just another WordPress site

Bioconductor Workshop 1: R/Bioconductor Workshop for Genomic Data Analysis

SPEAKER 1: Thank you for coming and thank you Ashok and others for arranging this I’m sorry about the technical aspect of how you’re going to interact with R in the course of these lectures There are many different ways to do it In fact, there’s a little slide that we’ll look at that reviews some of them And I think, in the not too distant future, we’ll come up with a way so that you can do computations in an environment like the one that I’m running, if not identical to it, so you can try things out yourself That’s the basic idea We instructors will give some ideas, underline some things that make Bioconductor the influential environment for doing genome scale statistics that it is And the idea was that the four folks who are here to help out will be wandering around the room to see if people are having trouble with any part of these computations I think it’s really important that we have an interactive relationship, so I do ask that you interrupt me If I’m not looking in your direction while you’re waving your hand, just go ahead and ask me to stop and we clarify what is going on here Now, how many of you work with R a regular basis? How many of you work with Jupyter notebooks on a regular basis? Not that many, OK So I was in that boat as well I work with R just about every day, never worked with a Jupyter notebook, and then Ashok said, we’re going to use that And then I found that it was really a very congenial environment for doing what we’re going to do in the course So it’s worth learning about and I don’t think that’s going to be an obstacle in any way, as long as we get it running Now, how’s the font here? Is this big enough for folks to see? It’s not really so much of a lecture slide presentation, but I’ll make sure that we get through the basic points It’d be nice to ask people why they want to know about Bioconductor, but I think we have too many to really go over it The appreciation for R can come from many different domains One of the most basic ones is there’s lot of code in it and I need to use some of it But to get more of an appreciation of it, what we’ve put a lot of our time into is interoperability so that it runs on many different platforms, all the key operating systems, and so on And the software component collection is very rich so that you can do all kinds of different statistical modeling approaches, you can use all kinds of database back ends, you can work with the web, and so on and so forth And yes, other languages accomplish this, but R integrated statistical reasoning with those functionalities a bit earlier than some of the other languages and we’ve kept it going since 2000 or so when we started Bioconductor So this is a motivation for thinking about R as a place to do genome scale statistics Let’s just talk about what we are going to be making use of in R The two concepts that I think are worth bringing to the fore, no matter where you’re working in R, are the concepts of functional programming and object oriented programming They’re really important for computer science scholars How important are they for people who are doing bioinformatics? I think it’s worth recognizing how easy it is to create a function in R. You basically define a new variable and give it the value, a function, of some parameter, some argument And in this case, we’ve written a function We call it cube That’s a natural way of saying what we’re going to do is take a number x and take the third power And so that’s a little bit of R that we’ve written and when we run it– wonderful So there we go We were able to do it before, but we’re not able to do it now That’s OK I have it running on my own machine I think I have it here Local host, yes So let us blow this one up a little bit and keep going We’ll just use some R here to demonstrate– well, it’s already done Let’s clear this cell Oh my goodness, everything is changing on me I want to clear all outputs

I’m on mobile display, folks Just hang on a minute Why is that? Yeah, I increased the font too much Well anyway, here we go Let’s run our cube function You’ve already seen what it looks like All right, this is fantastic Let’s run that bit of code first, which shows you already how to run cube of four, and then we can come to the cell and run it again with a different input and we get the cube of 7 So we have to find a new piece of software called cube, it takes numerical arguments, and it gives you the answer Now here’s an exercise: given the cube function, what’s a concise way of defining a function that computes the ninth power? And if we remember our rules of arithmetic, we know that we can just take the cube of the cube, multiply the exponents, we get the ninth power, and this is another way of writing a new function We make a function that is built up out of calls of other functions and then we run this one And we see, first of all, that when we run our function min, we get a number called 262,144 And then we can also do that in primitive R notation, taking four to the ninth power, and we get the same answer So this is a little bit of learning about R as a functional language We can write new functions using this syntax where we may have a number of parameters here, and then we define what we want to do with these parameters within these brackets And this defines a new function, g, which accepts a series of arguments The body of the function uses those inputs and R’s syntax to compute new values OK? It’s not very earth-shattering, but everything in R comes from evaluating functions of this type So if you’re an R user, you’re familiar with those ideas, but I want to make sure we’re all on the same playing field with respect to that aspect of R Now one may want to ask, well, if I know R, why do I need Bioconductor? And one of the reasons is that Bioconductor has used R to create a bunch of artifacts, packages, data objects, data structures, and so on, that help you do genome scale statistics So Levi wrote this nice table that compares R to Bioconductor R is a general purpose– how did that happen? Somebody did something out there You’re all looking at this and you can actually modify my environment? No, this is my local one Anyway, general purpose programming environment Bioconductor goes and packages things up for bioinformatics R is a de-centralized open source project, but Bioconductor– I should have put more acknowledgments here– is funded by several NIH grants and has a core team of full time developers R provides mainly generic data structures, Bioconductor provides integrative data structures for dealing with things in the Omics world R is enhanced by 10,000 packages in the CRAN ecosystem, Bioconductor has an ecosystem of 1,200 add-on packages that are quality controlled in a certain way that is not shared by CRAN I won’t go into that right now, but if people are interested in it, maybe by the end of the course, we’ll talk about some of the approaches to quality control that characterize Bioconductor that I think are quite useful Yes, CRAN enforces basic package requirements, Bioconductor has more requirements In particular, we’ll talk about something called the vignette for each package and the reuse of core Bioconductor data structure And then yes, as you said, there are commercially supported versions of R. We have several NIH and European Commission grants that help us do this work That’s R and Bioconductor compared in a nutshell So you have a reason for coming here Now how about the objects We talked about functions for a minute What is an example of one of these objects that helps us do genome scale statistics? Well, there’s a library called homosapiens Homo dot sapiens, to be precise You have to use the dot And if you install that package and you run the methods function on class Homo dot sapiens, you see that there’s a bunch of methods that have been defined for this entity And if you go over this list, some of the items are kind of dry– [INAUDIBLE] what would that [? be?– ?] but genes five prime UTRs by transcript, three prime UTRs by transcript, micro-RNA and so forth, those are interesting from a biological, bioinformatic

perspective What do those things do to Homo dot sapiens for you to work with your bioinformatic data analysis? Let’s take a look So let’s run the genes Let me run each one of these guys so that I don’t get into trouble And we’re going to run the genes method on Homo dot sapiens Now there’s a lot of infrastructure that gets loaded up in order to help you do this, but finally, when the genes method returns its result, we get this thing Now, some of you have used R, some have used Bioconductor How many are familiar with g ranges? A modest number Good So g ranges is one of the fundamental contributions of Bioconductor We wrote a paper in [? PLoss ?] computation computational biology a couple of years ago describing the infrastructure And it’s a highly efficient implementation of an algebra for dealing with intervals on the natural numbers So sets of integers that are ordered, and then we make subsets of them That is a model for regions of the genome and then the idea of a gene as a region of the genome that has a start and end in a strand and also some metadata, like its entrez gene ID, this is all assembled in this g ranges object which comes back to you when you run genes on Homo dot sapiens If you work on other model organisms like mus musculus or rattus norgevicus or whatever, we have objects of this type as well You run genes on that, you get back a g ranges Same object type You may have a different type of metadata about it, depending on what type of annotation was used to generate these addresses, but you get this very homogeneous approach to working with the collection of gene addresses for different types of organisms Let’s go a little bit further The nice thing about these notebooks is that you can add new code Actually, I’ll add a new cell here and just run some more code Let’s do seqinfo of Homo dot sapiens You can see that there are 93 sequences, and the reason for that is that we have the ordinary chromosomes and then we have these more difficult to place sequences that are of biological interest that have been organized by the UCSC, but they haven’t been placed on chromosomes quite yet Now notice also that we have labeled this as being from the HG19 genome, and we also have instances of annotation for HG38, but we don’t have a Homo dot sapiens, as far as I know, that deals with HG238 You have to know a little bit more to implement this Homo dot sapiens for the more recent genome build than I will get to in this course Got any questions or comments? Pretty straightforward stuff Kind of the most basic reason for getting involved in Bioconductor is that we can work with reference information about genes of different model organisms Yes? AUDIENCE: [INAUDIBLE] SPEAKER 1: Oh, I see So you are supposed to be walking along Yeah, yeah Do you see anything when the kernel has died? If you keep scrolling through, do you at least see this text? AUDIENCE: [INAUDIBLE] SPEAKER 1: Yeah AUDIENCE: [INAUDIBLE] SPEAKER 1: Yeah No, no, no This talk is being done on my machine, so do whatever you like Let me comment on this a little bit The first is that the failure has nothing to do Bioconductor [LAUGHTER] SPEAKER 1: And it’s the kind of thing that happens when you’re working with these cloud providers It can happen for any reason whatsoever And I have a feeling we’ll get through it to have the desired environment, but what I think we’re going to do is get each of you to have your own machine with an image of R on it that has all of the software that I’ll be working through and then

you’ll do things, but probably not through the Jupyter notebook framework We’ll talk about how to deal with that, but right now, we’re just going to talk about some rather general issues that you don’t necessarily have to be doing code in order to appreciate them and then, as the technical solution converges, we’ll see what to do in terms of hands-on computing So to review, what we’ve shown is with R and Bioconductor, you can improvise software in your session with user defined functions And then we also have these ideas of objects, and that’s probably one of the more controversial things about Bioconductor itself Back in 2000, there was a new object system introduced in R called S4, and we decided that we would use S4 to implement things like Homo dot sapiens and so on And that is something that is objectionable to certain subgroups of the R development population, but we stuck with it and I think it has paid off So methods on objects are ways of expressing high level concepts about genomics and that’s what Bioconductor gives you We’re going to learn a lot about different types of objects and all the methods that apply to them in order to do things like differential expression analysis, network construction, and so forth So those are relatively nuts and bolts aspects of working with R and Bioconductor, but then there’s this idea of library And this is a somewhat complicated issue because if you don’t know the name of the library that has the object or the functions that you’re interested in, you may be in a bit of trouble Not too much trouble, because you can ask your friends or you can ask the Bioconductor mailing list, or you can do a certain type search of the available software to find the thing that you thought you knew about So we’ll talk about ways of solving this problem, but it’s very important to have a working memory of all the different libraries that give you software and data that you need to solve your problem We might as well take a quick look at the Bioconductor website just so that we’re all on the same page there Bioconductor dot org So blow the up front a little bit, and if you haven’t been here, it’s worth spending a couple of minutes scrolling through What I wanted to show you is the support site So there is here something which is very much like Bio Stars, which is a forum for questioning and answering questions about Bioconductor and genomics generally And people get credits for the solutions that they give and we have scholars and teachers and so on So this is a good place for coming for help and you can also search the website as necessary to find things that you might be interested in I’ll mention also that the software is managed using an ontology so that all the software packages have these tags that lie in these high level concept terms that have been adopted here So if you look, for example, what could biological question mean? You can expand that and see that, well, there are packages that have labeled themselves as dealing with alternative splicing or driver mutation or functional prediction and so on So there’s a lot of help on the site to get you to the resources that you need And when you have problems, go to the support site and tell people, this thing isn’t working well enough for us, can’t you improve it? We don’t want to deal with that Here we are So, library Everything in R proceeds by valuations of functions You can write scripts, but they wind up being function calls And for your own strategic awareness, here’s an idea Anything you do more than three times should be a function, and every function should be in a package How many folks here have created an R package? OK It’s not the hardest thing in the world to do Making it relevant for your own research is something which is perhaps a bit of a challenge I don’t think we’ll get into packaging here, but there’s plenty of resources to help you do that And this idea that you should think about function design for things that you do repetitively is one that’s worth bearing in mind So many ways of using software in R: you can write scripts and execute them at the command line, you can use RStudio, you can use R as a command line interpreter– we’ll probably do a fair amount of that You can use Jupyter notebooks and you can use R through online apps We won’t get into that That seemed a little too advanced for here,

but if people have questions about that, come to me offline and we’ll show how to build shiny apps We have some nice examples of that So just to wrap up these very introductory remarks Data structures to work conveniently with genomes and genomes scale data So to parse, for example, data that come out of microarray scanners or sequencers, preprocess the raw data so that you have comparability across assay outputs that may need some statistical modification in order to be interpretable in a comparative way Combine assay quantifications with several level data So you did an experiment, there are many different samples, you have to annotate them in order to structure the comparisons that you’re going to make in order to test your hypotheses We’ll talk about how to organize this type of information in a way that makes it more difficult to mismatch and have bioinformatic errors because of the coordination structure that we’ve adopted here And finally, we’ll be reviewing objects and functions so that you can do all these things So one of the things we’re not going to do here is get into installing R Once you’ve got R going, you can use this to get a version of the BIOC installer package that is appropriate for the version of R that you’re using And this idea of versions is actually very serious You should keep your R up to date, but it may be the case that you’re working on a project that’s taking years to complete and things change between versions of R and you want to be able to work with the old version That problem has been addressed in a number of different ways, and you can have parallel, different versions of R on a machine and you can switch between them conveniently I’m not going to get into that, but I just want you to know that whatever version you’re working on, you need to use the Bioconductor software that is relevant to that version You do not mix versions Even though it can be done technically, it is a very bad idea You ultimately run into trouble So keep versions of R separate from one another and you have the appropriate Bioconductor for each one of those versions Once you have it going, you can install a package And this is an example of installing a package that’s relevant to an online course that we teach And I’ll just run this in my shell here And I think what it’s going to tell me here– it’s kind of intelligent– I already installed it previously, and it checks to see whether the version that’s in Github– this is a [INAUDIBLE] package that’s actually in a Github repository So as it hasn’t changed, it doesn’t bother doing it So that is how easy it is to go and install packages that are perhaps in Github repositories, more likely they’re in the Bioconductor repository or in CRAN and you use BIOC light to get the package and any other assisting packages that you may need And there’s plenty of help at the Bioconductor site to do that Yes? AUDIENCE: [INAUDIBLE] SPEAKER 2: Actually I thought I would take issue with something up a little bit above where it said that [INAUDIBLE] SPEAKER 1: Yeah SPEAKER 2: There is another approach, the double [INAUDIBLE] more and more just to avoid loading SPEAKER 1: Yeah SPEAKER 2: Make it clear which package functions come from SPEAKER 1: We’re getting a little technical here Certainly for programming, I think, knowing about double colon is really important, but I guess the point that you’re making here– let’s see if we can make it live Do you want to come up and demonstrate a working example of it? SPEAKER 2: Sure SPEAKER 1: OK I’ll make you a new cell AUDIENCE: [INAUDIBLE] SPEAKER 1: How about limma?

SPEAKER 2: Double question mark first just to see what some functions of limma are? SPEAKER 1: How about LM fit? All right, sit down SPEAKER 2: OK Like this? SPEAKER 1: Yeah Just show the code Yeah SPEAKER 2: It ran [LAUGHTER] SPEAKER 1: OK The point is well made, but I think it’s a little technical for what we want to get into He’s absolutely right If you want to get access to a function, you do not need to use library to get access to the function You can use the name of the package and a double colon and the function name And we need to revise these notes to reflect that because you’re absolutely right I was hoping that you would just actually be able to see the code here and that is often worth doing There it is So again, nice thing about R is that it’s open source Nice thing about R is that when you’re in R and you’re running some function and you’re concerned about what it does, you can print out the function code and see it So here’s a case where Levi’s point I think is well made You do not have to say library to pollute your namespace with all the functions in Lima if you want to just see what the LM fit looks like It’s worth knowing Now what I was in the middle of saying was that you can use question mark– not double question mark– to find out about functions that you may be interested in And I thought it was very nice the way in which the Jupyter notebook gives you access to the help So it starts a new web sub-page, which gives you all the information here about the way in which mean is computed And interestingly enough, this mean is coming from BIOC generics That’s a package that is created in Bioconductor that deals with means somewhat more generally than R itself It’s nice to go off topic So let’s see whether there’s a mean function in stats Not exported So where do you think it is? Base? Yeah And you see how informative it is So now we’re getting something called S3 object oriented programming We’re not going to take it any further But I wanted to make sure that you saw that you could use, as Levi demonstrated, double question mark to get help, which is going to search all topics for the string that you put the double question mark in, or you can use question mark to generate a [? man ?] page Another very nice piece of R’s functionality is the idea that you can have working examples in the man pages for different functions So if you’re interested in understanding the LM, which is for linear modeling, function of R, you can say example LM and this code will be run for you and it will show you how LM behaves So here we are, we’ve run LM with a certain dataset We’ve regressed wait on group and we’ve done it again and then there are some other methods going on, but ultimately, we have a plot of residuals and other diagnostics about the regression that has been fit And that’s all taken care of by the example function So I think those are nice features of R to get very familiar with Now let’s try this one Help package equals gene filter, help type equal HTML So here we are Package level documentation, which is essentially going to tell me who wrote the package, what packages does it need, and what are the different functions that are defined in that package with some explanatory text So many different ways of getting help in R that I wanted to make sure we have had a look at What about vignettes? So again, a Bioconductor package that we won’t be visiting for some time, but I just want to illustrate this concept for you

that when you run the vignette function, you are taken to a man page or PDF document that gives you a very careful description of how to use a package, not just a function, but how to use many different functions in the package to do some biology So this vignette function is very powerful and all Bioconductor packages are required to have vignettes and they look like this They tell you how to go through an analysis, they may include graphics, and so on And I just wanted to make sure that we don’t miss that concept there OK? That is the end of my discussion of Bioconductor and installation and very general ideas We can take a break, we can have some questions or discussion Where’s Ashok? Are we ready for a break yet? It’s 9:34 Maybe I’ll do a little bit more and then we’ll take a break Maybe at 9:45 You’ve got it going? All right, let us talk a little bit about data management and data structure for genome scale experiments and then we’ll take a break and hopefully everything will be running and then we can all do these things jointly on the notebooks So data structure and management Who likes data management? What’d I say here? It’s often regarded as a specialized and tedious dimension of scientific research And then, because failures of data management are extremely costly in terms of resources and reputation, highly reliable and efficient methods are essential So I don’t know why nobody raised their hand So the customary lab science practice of maintaining data in spreadsheets– how many are using Excel to do your scientific data? OK, you can say Yeah It’s regarded as risky We want to add value to our data by making it easier to follow, reliable data management practices So what I’d like to do is have you be able to make the choice between Excel and something else– maybe Bioconductor– for managing scientific data in ways that take you more reliably and in a more streamlined way through the scientific inferences that you want to make It’s not free You have to learn about these structures, you have to understand their justification, but that’s the hope that you’ll at least be able to weigh the options So here’s some more stuff Bioconductor, principles that guide software development are applied in data management So data structures that are modular and extensible So if you think of a good way of managing microarray data, and then some other technology comes along, maybe you can reuse that structure, which is already known to have good properties, to have the metadata tightly bound to the assay data and so on So this idea of modularity and extensibility, Bioconductor really hangs a lot of value on that Packaging and version control protocols applied to data structure definitions So you have a good idea about how to manage a microarray, but then a few months go by and people think, well, I think we should put some other things on it We stamp the new approach with a version We have the old approach has a version and you know what you’re getting when you get the object Motivate and illustrate these ideas by giving examples of transforming spreadsheets to semantically rich objects Working with geo and dealing with families of BED and BAM files And then maybe we would even think about how to work with something like TCGA as a unified entity How many people are working with TCGA? Just a few OK The Cancer Genome Atlas, cancer biology, cancer genomics Many people do have to confront it and maybe in your future you will as well OK, so let’s talk about a spreadsheet approach and then we’ll talk about a Bioconductor approach This is what we’re going to get to So I want us all to be on the same page with respect to the thing we’re trying to solve Microarrays– how many people are interested in microarrays? A reasonable number of you How about sequencing? RNA seq?

More OK, you’re younger than me Doesn’t matter which You have done N of your RNA seqs or your microarrays, right? What is n? N is the number of biologically separate samples There may be some replication structure in there We’ll ignore that for now We’ll say there are N samples I need to manage them somehow So I have an array– I have a matrix, really There are G features Might be the number of probes on the microarray, might be the number of intervals that you decided to pile up your reads in Whatever it is, those are the features of your samples that you’re going to measure uniformly across the samples At least that’s your intention– to get G features on each one of these samples So I get this G by N matrix, whatever you want to call it, and the IJ element is the i-th feature on the j-th sample measured by your assay Well, Excel takes care of that Nice spreadsheet Anything you want We have this G by N matrix that we will interrogate using indices i and j Well there’s more than that because with each one of these N samples, I’ve recorded some information What treatment was used, what was the batch, et cetera, et cetera, et cetera There may be as many as R features of each sample that I want to record and usually, these are a little more involved than just real numbers It could be categorical, could be a sex, it could be some intricate representation of a protocol identifier, what have you These are attributes of the samples and they must be managed in a coordinated way Sometimes I may want to just get all the data, let’s say, on the controls So I want to know which of the samples are controls and then get only the assay results for those I want to be able to do something like that The information on which is a control and which is treated is maintained in this other object How do we coordinate these things? And how do we also keep track of things like, well, I have feature names here, what are those feature names? Are they gene names? Gene symbols? Are they probe IDs from a microarray? Are they ensemble IDs? How do I tell people what these tokens are? And how do I bind other information about the features or the experiment or whatever into an object so that there’s this coordinated, coherent thing that represents all this information? If it’s spread out over a bunch of spreadsheets, there are more opportunities for error in merging things, linking them together, ad hoc We want one coordinated bundle for this type of information Is that clear? That’s what I’m going for here So we’re going to use all the things I talked about with respect to R and Bioconductor to do that coordination And here’s an example of the uncoordinated data We’ll finish with the uncoordinated data and take the break So the uncoordinated data come out of this package, GSE 5859, you can go in Geo and find it But they took a subset of it to teach the course And what do you have? Well, we have that matrix G is 8793, N is 24 There’s a matrix of gene expression numbers and the Jupyter notebook prints it out very nicely Well, these are the sample names and these are the feature identifiers Some of you who are older will recognize those feature identifiers as assay metric’s probe IDs, but their real interpretation would depend upon which array they’re associated with, so you want to know that But that’s the gene expression data The other piece that I talked about, which was the N by R, is this sample info table And as you can see, you have complicated data structures there– it may be a file name, there may be tags for ethnicity, and there could be some numerical data in there as well So you don’t want to use a matrix to represent that You want something like a data frame So this is the sample information These file names would associate with the column names of the assay data That would be a good thing Actually, let’s check that This a little bit of R to check it Check that file name is always equal to the column name of the gene expression, element by element That turns out to be true The other thing that I didn’t talk about much was well, you may want to have annotation about those genes bound to these data So these are the probe IDs, but there’s more interesting information in the symbols that are mapped

and that is actually a dynamic process of saying which gene a given probe interrogates So these are more complications that we try to take care of for you And so you have to understand R to work with each one of these things separately And for example, we have a little exercise here Tabulate the ethnicity for the samples in this dataset So in order to do that, you just use the table function You can get the ethnicity out of it using a dollar sign That’s just a data frame syntax and we see that there are 24 people, 23 of them are of Asian ethnicity One of them is from Europe And the type of code that you would use in order to compare expression values for a given gene is something like this I have to take the gene expression data, find the row that corresponds to pax 8, and then split that data on that gene by group, and then make a box plot So that’s doable And that’s how many people do their work You write the right function on the right data and you get the comparison that you’re interested in OK, what we will do– we have the exercise, then, how would you do this for DDR one? You have to go back here and substitute in two places the gene symbol that you’re interested in, and that can be done So now the question is, how are we going to unify these tables using expression sets in Bioconductor? And that will be the thing that we do after the break Thank you SPEAKER 3: Yes, sorry about that I think we can get you all up and ready [INAUDIBLE] exercise up here that it’s probably important for you to follow along [INAUDIBLE] different format than what you’ve done before because of the way this turned out, so really what we want you to do– AUDIENCE: [INAUDIBLE] SPEAKER 3: This should actually be working for most of you [INAUDIBLE] Add a cell here AUDIENCE: [INAUDIBLE] SPEAKER 3: Second one will be [INAUDIBLE] So could half the people in the room raise your hand please? SPEAKER 1: Half? SPEAKER 3: Yeah SPEAKER 1: Count down, one, two, one, two SPEAKER 3: Let’s do this So I guess this side, we would probably go with– I’ll let you just login [INAUDIBLE] Just put that up in your browser bar SPEAKER 1: I think you need another dot there SPEAKER 3: And on the right, [INAUDIBLE] and that will be the password Did everybody get into the login screen? AUDIENCE: [INAUDIBLE] SPEAKER 1: Just one colon SPEAKER 3: Extra one here So that’s the first one The second one should be fine, right? Is everybody on the right side in? So now when you login, [INAUDIBLE] you will come

to a page that should– SPEAKER 1: But you have to put it in the browser SPEAKER 3: So everybody will open up– SPEAKER 1: Now put in [INAUDIBLE] SPEAKER 3: One notebook and save it with your name SPEAKER 1: Everything working OK? SPEAKER 3: Make a copy and then you can change the name up here to– [INAUDIBLE] And that would be the notebook you’d be working with SPEAKER 1: You have to put it in the browser bar SPEAKER 3: Yeah So the moment you logged in, it should come to the home page So first, is everybody logged into the page? Do they all have a home page? Great So once you click on the period one [INAUDIBLE] and then you go to file and say make a copy OK? And then it’ll open up another page with your notebook and add your name to that and that’s the one you’ll be working with [CHATTER] SPEAKER 1: I’m sorry to be rude before SPEAKER 3: No, no, no [CHATTER] SPEAKER 1: So you’re getting them to save the notebooks individually SPEAKER 3: Yeah [INAUDIBLE] So your instructor stuff works fine, too, if you want to use that as well SPEAKER 1: All right [CHATTER] SPEAKER 1: Does anybody not have the connection that they need? No? OK So why don’t we start again at 10 o’clock That sound good? How about 10:10? I want to thank Ashok, and we’ve got some new toy here SPEAKER 3: Yeah, so this is a catch mic As you are all aware, you are on camera, so make sure you smile and because we’re recording this if you have any questions, let us know and we’ll throw this at you and then you can ask you a question into it, and then, once you have your answer, you can throw it back to us SPEAKER 1: Let’s see how safe this is If it hit me in the glasses– [INAUDIBLE] [LAUGHTER] SPEAKER 1: I don’t wanna throw this around too much SPEAKER 3: But we’ll try to come closer and– [INAUDIBLE] Again, the notebooks are working and some of you might have trouble getting the commands to run, it may be taking a long time or something Try to see if you can refresh your browser at this point and if not, we will try to spin up more instances and move people over as we progress along Yeah? AUDIENCE: I want to run all of this locally in my computer in my own [? art ?] studio I’m wondering about GSE 5859 Where would I get that package? SPEAKER 1: OK, good question and it gets a good answer

I’m going to scroll down, because I think we took care of this It’s a very nice feature of Bioconductor– or actually, I should say it’s a nice feature of install dot packages in R. If you use this syntax to specify the package name, it will assume that this is a Github repository and it will go to Github and get the package for you OK? AUDIENCE: There are some more packages in the second one that are like that Everyone that [INAUDIBLE] install [INAUDIBLE] SPEAKER 1: No You have to be package by package It’s a pull request if you figure out how to do that AUDIENCE: [INAUDIBLE] SPEAKER 1: Yes, it can be a nightmare, actually Get to R? Whoa, whoa All right, that’s not Bioconductor [LAUGHTER] SPEAKER 1: Now let’s review where we’ve been We had some nice tables here and I’m really thankful to Ashok for tuning me into these Jupyter notebooks because this type of rendering is quite attractive and it comes for free Any data frame that you want to show like that comes out very nicely But we are not happy about the fact that this data is spread out over all these different objects and so we’re going to think about expression sets And these are passe We didn’t really adapt them to doing RNA seq, which everybody’s so interested in, but you have tons of expression data out in Geo and they will come to you as expression sets for the foreseeable future, and so it’s good to know the principles underlying expression set and how to work with them So as I said before, G by N table records gene expression values and an N by R table records the sample information Now I guess the links on these fine new books that you have are not worked out, but what I had for you here is the picture that we saw before So that picture is out there somehow, but they didn’t work out the links into the new notebooks, I think No problem We want to put these all together in one object, and the way we do that is we use expression set So you bring in library bio base– you should have this code in your notebook GS 5859 is an expression set function call with the argument assay data equal gene expression That’s the matrix of G by N numbers And what we’ve done here is created ES 5859, and then we’re going to bind in there the sample info data frame using the P data So this is very peculiar syntax now We’ve talked about function calls, and that one should make sense, but what about this? We have a function call followed by an equal sign This can be done and it is because everything in R is a function call So even these assignments can be written as function calls and all this does is rearrange things so that this object gets updated so that it has a P data component that is equal to this data frame If you didn’t follow that, don’t worry about it It will work and it will give you a much nicer thing than the sprawling tables in your environment that you started out with So let us break this up a little bit Now there’s a nice thing in these Jupyter notebooks that allow you to split up these cells and so we can run them one at a time You don’t have to do this, but I’ll just show you the pieces as we go piece by piece So what is ES 5859 at this point? Gene expression not found So I have to keep running all of my code Sorry about that You have to run a little bit here Assuming that you installed it from the task before, this should work, and now this little expression set event should work right here

Let’s do this one as well OK, so I’m jumping around a little bit I’m sorry about this And we’ve changed the modality a little bit, so let’s go step by step through this one more time We will start here and make sure everybody succeeds with this Library GSE 5859 subset and we run a data command that gets us a number of data sets Gene expression, gene annotation, and sample info Does that work for everybody? AUDIENCE: I don’t have as many of those– SPEAKER 1: That’s OK This is my personal thing There’s a bunch of stuff, but as long as you have those three, you’re in good shape Gene expression, gene annotation, and sample info As long as you have those three things, we can keep going So then we’re going to take the dimensions of gene expression and find that it has 8,793 probes We look at some of them and we see the values for the different columns Everybody with me? I see a lot of heads nodding, so I’m going to assume yes If you have a problem, please let me know The sample info is a table and the gene notation is another table We’re going to check that the sample info file name is coordinated with the column names of gene expression If they were permuted in any way, that would be a problem That would tell me that my sample data is not coordinated with the assay data And then we do another little example here to tabulate the group variable This exercise, just make sure that you know how to tabulate ethnicity, how to get the ethnicity variable out of a data frame And this one is an illustration of box plotting for a given gene, and the exercise would have been to say, OK, come down here, but show that you can substitute for a new gene symbol So let’s do a different gene Let’s try BRCA two I don’t know if it’s on the array, but we can give it a try I don’t mind if there’s an error, but there it is It’s nice to see errors because then you can see me recover from errors and that’s something that is good for learning, but we didn’t have an error Let’s make an error Let’s make a gene that doesn’t exist What happens? There you go So when this kind of thing happens, you have errors, there there are ways to respond to it when the function calls are more complicated You can run a debugger I won’t go into that now, but it’s good to know that when you have an error, this guy doesn’t die, but you can just fix the error Or, I forgot, let me try BRCA one Maybe there’s a BRCA one Yes OK So these are very forgiving notebooks You can throw errors and then you can fix them up and you keep going OK? So that’s a little bit of improvisation And now we’re going to get to this point where I’m splitting up some of the tasks here The first thing I need to do, I’m going to split this up, and right here, we will split again We will look at sample info And let’s evaluate sample info Notice that the row names of sample info are these integers That’s not good When we put our things together, we want to have row names that correspond to the column names of the assay that we’re going to bind this to, and that’s why I run this command here I’m just going to put the row names, and that’s that strange syntax that I talked about before where you can evaluate a function of the left hand side of an equal sign This happens a lot in Bioconductor Don’t be afraid of it If we now go and print the value of sample info, you will see that we now have nice row names They’re better than these anyway So that’s what we’ve got there Now how about with a gene annotation? Let’s look at that What’s the matter?

AUDIENCE: [INAUDIBLE] SPEAKER 1: OK, let’s try and find out why What kind of errors are you seeing? I think you’re just not seeing any results, but it is running So when we get to the next stage and we print out some values– [INAUDIBLE] So here we are This is gene annotation, a very long table, and I’m just going to pop in the row names to be probe ID I’m hearing a lot of rumbling and so I am improvising a little bit to show you the background of each one of these calculations, but we’ll get back to the track Everybody on track now or is there a few questions that we should take from the audience? Where we really want you to be is here, evaluating ES 5859 Yes? We can throw That’s OK AUDIENCE: If you don’t change the row names– SPEAKER 1: Yes? AUDIENCE: What happens when you try to do the library bio base thing? SPEAKER 1: That’s a great question Let’s try it out OK? So what I can do here is say row names of gene expression equals null OK, let’s try it Row names are gone It still works It’s a little different though Let’s go back over here to our friend Let’s look at this carefully If you look at this expression set– doesn’t look any different Well, let’s keep going There are no row names there When I put in the sample info, that seems to work as well So it doesn’t seem to penalize us too much for not doing that But this is a way of just ensuring that the metadata are there So what we’ve done now is we’ve bound in the gene annotation here and we have a more extensive report because the phenodata has sample names, it has variable labels, label description, and also we bound in this feature data that gives us all of these probe IDs It tells us that there’s also information on symbols and so forth And this is a very rich annotation of those three tables that we started out with And what that enables us to do is to take advantage of that unification and reliably create something like a function call two sample [INAUDIBLE],, which takes an expression set as an input, the stratification variable, which will define the two groups, symbol of a gene that I’m going to want to compare distributions on, and then another parameter, which is a little obscure, but it has to do with the fact that sometimes there are multiple probes for a given gene on the microarray And you have to pick one of them in order for this to make sense So this little function works with the f data to find out which gene we’re working on The P data, the sample data, to figure out how to split, into the different strata, and finally, it has some labeling for the box plot Let’s see how this function works We’re going to run it, grouping and running it on bracket two So now we have an error And I think the reason is that I took away the row names of the gene expression So let’s put it back There are certain things that require those names to be there as we build this through So I’m going to rerun all of these little commands here,

add sample info here Now I’ll just step through each one of them, but I will not set the row names to null There we go So there is a place where it helps to have those things in there, because of the way I wrote this up, and there we are We have a function now for bracket two and if you want to change the gene that you’re going to plot, let’s say, DDR one, you can just change the name of the gene and run your function again and you get a different comparison So what we’re trying to do is motivate the use of an expression set to put all the data together and understanding the components of the expression set so that you can write a function that does something that’s of interest to you I’m not expecting you to be able to write these functions at this time, but this is really a motivational set of remarks Now here’s a very nice additional thing we can do There’s a nice library called annotate, and if you have a pub med ID of a paper that’s relevant to some data that you have, you can go and run a PM ID to Miami function Let’s do that Again, I’m just going to split this up so that we can see what’s going on AUDIENCE: [INAUDIBLE] SPEAKER 1: Oh, yeah You do shift control minus where you want to split Put the cursor or someplace and then do shift control minus and it will split And then when you run the cell, you now have this entity called MI, which I will evaluate This is a new object It’s the Miami– minimum information about a microarray– experiment schema populated by information about the paper that this experiment was derived from So 145 word abstract is also kept with this object, so what’s happened here is R has gone out to NCBI, taken this pub med identifier, made a query to pub med, and pulled back a bunch of XML that describes the paper that we have the pud med ID for And then we can bind that into our expression set and that way anybody who we hand this off to has information about the pub med ID, and you can also run the abstract command on that to derive the abstract So that is living now inside this entity, this express set, ES 5859 I don’t know about you, but that’s a type of data management that I think is worth doing Because when I hand this thing off to somebody, they can learn a lot about the underlying experiment just with that object So that’s the motivation there Any question, comments about this? Yes? AUDIENCE: [INAUDIBLE] SPEAKER 1: Yes I think you’re referring to– it has to do with the libraries being attached as you bring in some software Is that what happened? Yeah So just to illustrate that, I’m going to just run into the terminal here because the notebook is slightly different But if I say, for example, library here, annotate Yeah, there’s a lot of stuff that comes out This is R telling you, oh, you wanted annotate, but you have to get all these other things in order for annotate to work If you don’t like that, you can do it another way Well, Levi has one way Suppose I say annotate colon PM ID to Miami You can get the function– you notice it’s kind of slow– but the function will come and you didn’t have to see all that junk You can also do the following: suppress, package, start up, messages And you put an expression in there, library, annotate, and then you won’t see those– SPEAKER 3: You might want to increase the font SPEAKER 1: Yeah [LAUGHTER] SPEAKER 1: OK So there’s another way of taking care of that problem It’s not really a problem, it’s just verbosity AUDIENCE: [INAUDIBLE] SPEAKER 1: No Good So adding some metadata by publication origins I love it Now, let us use NCBI geo to retrieve an experiment

This is a wonderful package called geo query It’s by Sean Davis and some colleagues at NCI Library geo query and this is going to be sort of the winding up of this session We’re going to go into geo, we have a GSW How many folks have worked with geo before? I don’t want to go too fast here, but it’s kind of de rigueur for computational biologists that there’s thousands of micro experiments sitting in the internet through the gene expression omnibus and they have these identifiers called geo series identifiers And it just so happens that this is a geo series identifier for a study of glioblastoma tumor cells So let’s go ahead and run it We’ve already put up with a lot of stuff So let’s run it and hope for the best And again see verbosity here It’s telling you they’re resetting some options, we’re getting some things back I hope you see this Depending on the system that you’re using, it should work, but it takes a little while to process the data So you can see here I’m trying to evaluate this expression set– there it is It’s ready So I’ll wait for you folks Has anybody succeeded in evaluating glioma yet? You have? Alreadt? Good So people on one host have succeeded with this Just get a feel for what’s out there You go to geo, you see that there’s some experiment related to an experiment that you’re intereted in– yes AUDIENCE: It’s more help than anything I’m trying to use that function, but it’s not letting me It’s finding the files and it’s being unhelpful in terms of downloading it SPEAKER 1: What’s the error message? AUDIENCE: It just says error in downloading file SPEAKER 1: How many people have seen that? AUDIENCE: So– SPEAKER 1: All right So let’s just make sure we appreciate what’s going on here You’re just giving an identifier here This function is setting up a query to geo to pull down information about this experiment It may fail because your disk is full It may fail because you don’t have permission and so on and so forth I don’t want to deal with those things right now, but I do want to get it and I want you to see that there’s now an expression set here Let’s see what’s in there There are 12 sample How many features are there? 49,395 So this comes from a certain microarray that is GPL 15207 It’s one of the [INAUDIBLE],, I don’t remember exactly which one it is I think it’s prime view, actually So here you go You’ve got the data all managed for you And if you use the F data function on this, you’ll get the feature data that geo provides for you, and it’s really quite extensive The feature data is about each probe individually It tells you about the ref-c transcript ID, the protein IDs, [INAUDIBLE], and so on That’s all bound into the experiment for you You don’t have to go looking up, what’s the association of this tag with some annotation? Then it tells you about the actual samples, and what’s important here is that there is this sort of standardized notation about what’s on the micro array and what type of samples they were There are controls 24, I’m only printing out of controls, but there are also some that are treated with this drug, LXR 623 So 12 samples, six are controls, six are treated Yes? AUDIENCE: So I was just wondering how consistent are these kind of annotations across geo data sets and are they provided by submitters or are they required before submitting? SPEAKER 1: That’s a good question I don’t really have a definitive answer to that They’re pretty standardized What you can see here is an artifact of the early stage of microarrays, where there were two channel arrays Channel one was red and channel two was green Well, that doesn’t really play out anymore, but they reuse these terminologies, these terms, so that there’s a consistent annotation across data sets So it’s somewhat consistent, but you can’t always rely upon the ability to easily say, oh, this was a control sample You have to go in there, you may have to do some parsing, some string changes, whatever, to find out which one is control and to label it then Good question And many people have complained to geo about trying to do that better and I

think they decided that they did the best they could And there have been several projects– I don’t know, do you know any of these geo re-annotation projects? Some people have tried to curate I mean, you’ve curated TCGA, for example It’s a big job to get uniform terminology across them Some people do it and then when they make it public, everybody can use it, but geo Itself has never been able to AUDIENCE: [INAUDIBLE] SPEAKER 1: It’s a fraction of geo, yeah AUDIENCE: [INAUDIBLE] SPEAKER 1: You do have to do some custom, geo set specific programming It may work for a family of geo experiments Yeah, you basically are never going to get away with just what comes off the press AUDIENCE: What about the geo sets that are in SRA? Are they standardized to NCBI or are they still a little messy? SPEAKER 1: There are standards, but they don’t get you to the semantics that you need most of the time You have to do some work to get them AUDIENCE: [INAUDIBLE] A lot of work that we’ve done is picking small subsets of geo and SRA and standardizing them and it’s a lot of work SPEAKER 1: Yeah, I would guess the curated metagenomic is mostly SRA AUDIENCE: It’s all SRA SPEAKER 1: So if you’re interested in those data resources, we have experts here who can describe that situation to you The question is, for people who are thinking futuristically, how can NIH go further in making the data? They want it to be public, right? It’s all out there How can you make it easier to work with these things? And there is this idea of a genomic data commons where many more standards would be imposed and you’d be able to do a lot more things without all this nuts and bolts work And that is sometime down the road, but learning about genomic data commons and contributing to it would be relevant if those are concerns of yours So the upshot to this is that we can do our plots again, using this function to sample plot We have to know the names of the variables we want to work on, so treated with colon CH one is the representation of the drug or control And the gene that I’m interested in, and in this case it’s [? ABCA1, ?] and that was reported as one of the targets of this LXR 623 OK? Now there’s an exercise This exercise is real So let’s see how you do with this Let’s see if we can understand the formulation of the exercise Our plot, this plot here, does not fully respect the design of the experiment There are normal and glioblastoma derived cells, but all we did was say, if you were treated, then get one [INAUDIBLE],, and if you were untreated, you get another There’s actually a cell type stratum up in there that we didn’t deal with So the exercise is to look at this little table here Now here I have to reduce the font a little bit so you can see this more clearly So I can make a two by two table that shows the layout It’s not just two by six, it’s two by two with three replicates inside each one of the cells And your job is to use two sample plot to compare only the normal [INAUDIBLE] with respect to expression [? ABCA1. ?] Only the normal, but compare treated and control [INAUDIBLE] So that’s the problem, and I’ve set it up for you a little bit This is going to be marked down There it is I set it up for you so that all you really have to do is get something in here that [INAUDIBLE] The question is, do you know enough about R and Bioconductor to do that at this point? I’m not sure I could do it I’m going to wait one minute and I’m going to let somebody tell me how to do it and then I will start doing it OK, I’m going to break this up into pieces

So first of all, we’re going to do something to this expression here Am I going to do something to this part or this part? Left of the comma or right of the comma? Right Everybody agree with that? Why? Remember, I didn’t really well on this very much, so I’m glad you got it, but I think there’s a need to focus more on what this structure is really like when you’re using brackets It’s like an array It’s like a matrix This part refers to the rows– the features, the genes on the array This part refers to the samples So the problem says I want to only deal with these samples Therefore, I need to do something with this part of the expression It’s a way of subsetting a expression set that is relevant here Now the next question is, what variable should I use in order to change this part of the expression? It’s not easy because the semantics are so strange It’s this variable here I’m going to take this and copy it That’s the nice thing about these things– I can copy that there And now I’ve got this vector of treatment types, and the only ones I want to use are the normal astrocytes Well again, I need to use some R syntax I need to use a double equal sign and then this is the value of the variable that I want to use I just put that string in there and now I push run and hope for the best Didn’t work So what could have gone wrong? I’m trying to change my glioma here, so let’s see I’m going to take this out and see whether the error is in this part of the expression So let’s make a new cell and put this in here The claim is that that should give me a new expression set with just a smaller number of samples And indeed it does That gets me to six samples Is this clear what’s happening? I start out with 12 samples, both normal and glioblastoma derived, and now I am just with six, only the normals So that part is good What’s wrong? [INAUDIBLE] So let’s just reassign this guy here So we’ll call this norm We’ll save that That’s a new expression set I should be able to just run on norm here So it’s a new expression set, I’m going to just try and run it there And it probably will fail because I haven’t really done anything new There’s some error here, so what must it be? There’s something about this “treated with.” What could it be? Sorry about this So norm treated with colon CH one, is that OK? Yes So there are three controls and three treated,

so that variable should be working as a stratification variable So I must confess that I’m not getting the exercise right here We’ll have to figure this out Is that it? It’s really hard to know what to do here I’ll have to work on this another time It worked for you? AUDIENCE: Do you have a comma one at the end? SPEAKER 1: Yeah, that I put in just recently I have a typo? I’m good? I did something wrong I’m glad it works for you and I hope the principle is clear I would run the debugger here and find out where it’s failing but I don’t think I can do that OK, so that brings us pretty much to the close of this part one And we were saying we take– I think we’re on time Expression sets, unifiers, sample-level data, [INAUDIBLE] data are useful here Geo-query can be used and we haven’t gotten into the alternative to expression set, but we will eventually So any comments or questions? Sorry about that mess up at the end but, it was life OK, let us start period two and then we will take a break after we get our bearings So you go back to your original– you should have a tab called home Oh yeah, you want to save your original period one Yes, yes, that’s right OK So do a save in checkpoint Remember how to do that? You click on file and hit save in checkpoint and then when you go to period two, you’re going to want to make a copy So this one is on my machine You will do a file, make a copy, and put your name on it OK, go ahead and get that ready and I’ll be back in in a minute So period two is working with general genomic features This may seem somewhat dry, but ultimately, it has great value Make sure you have your own copy [INAUDIBLE] I’m hearing a lot of rumbling, so I’m happy to take questions if there are complications I don’t know whether you’re going to get the figure here, but this figure is fundamental What I want you to do is think of this diagram as a– these are bases on a chromosome and we’re going to define a certain range that we colored pink here It’s base five, six, seven, eight, nine, and 10 And we’re going to call that IR That’s our first IRanges And what we want to do is understand these operations– shift IR minus two takes this interval and moves it two bases to the left The narrow IR will take some arguments, such as start

or end, and it will tell you whether you want to narrow it using a starting position of the second position, or at the end you want to narrow it so that you only go up to position five in the interim We have a function called flank which has parameters start and both So if start is true but both is false, then the flank is this– actually, this should also have a width of three It’s missing from here But the flank, the three base flank that has both equal to false, will move– it occupies the three bases that are not part of the range to the left And you can also have a flank that is at the end that is to the right And then when you set flank with both equal true, then the flank will occupy both the part that is out of the range and also within the range If we multiply a range by some number, we will be narrowing it, resuming in This original one was with 6 When we multiply it by 2, we got a new range of width 2 If we multiply it by a negative number, we’re zooming out If we add, we are simply increasing the extent by the number added And if we subtract, we are reducing linearly OK And then we have a nice resize function, which will resize to a sub-interval of a given width And you can learn about all these things, all the detailed parameters, by using the help function So we can start by defining our I ranges– IR is this entity here– starting at position 5 and ending at position 10 It has a width of 6 because it includes these positions And then when we shift it or resize it, we get these results So this is something you just have to internalize Ultimately, you’re going to be doing things like finding overlaps between snips and genes and so on And these operations will be all under the hood But we want to give you some of the basics Now, it’s very easy to define, as I showed you before, a collection of ranges which could, for example, model the genes on the genome on a given chromosome, let’s say In this case, we just defined a bunch of starts, ends, and we are told the widths, as well And if we use the resize command on that collection of ranges, it’s applied element-wise OK, so I’ve resized each one of these to be size 1 And because the defaults are taken, it just takes the starting interval and gives me back that How about if I made it a resize of 2? What would the width be? Everything has width 2 OK And IR is just the original interval that we started out with Now, inter-range operations This is one way of looking at the IR Each one of our intervals has been plotted as a block on this range, or this domain, let’s say OK, so here we are We have these blocks here and each one of them is a range from the IR object that I talked about before So we plot IR to see this like this And those of you who were working in RNA-seq may say, oh, well this looks kind of familiar This is kind of like a pile up of the reads that I align to the genome So now we’re starting to see some applications There’s an operation called reduce, which will just project all the occupied spaces into one contiguous occupation and leave empty spaces alone OK, so that’s what the reduce operation does It takes the complexity of this collection of ranges and simplifies it by taking all the overlapping entities and just treating them as one contiguous region There’s also an operation called disjoin, which preserves all the break points that were present in the original collection of intervals So that adds complexity to what you have And then finally, there’s gaps So you can imagine that this could be, let’s say, a collection of exons for a given gene

And then the gaps would be the introns So that’s why these operations were defined And they’re implemented very efficiently OK Now, here is some code for working with these ranges in a more familiar way, in the sense that we can put names on them Just as with R we can take a vector of numbers and assign a character name to each one of them, we can do the same with our ranges So what I’ve done here is said I’ve got seven ranges I’m going to use the letters A, B, C, D, E, F, G to name those ranges And then I can select from them using names So this is associative memory I can use the names or I can use the positions for this vector-type thing And I get back the range that I was interested in Not only that, but it may be the case that I have some other metadata that I want to associate with each one of these ranges And this is a somewhat contrived example But all I’ve done is taken a data frame which has some numeric variables in it and assigned that as the [? mcols ?] of my original collection of ranges So the [? mcols ?] is just this part And then I’m going to say, well, I’m going to associate a row in my data frame with each one of these ranges And I get this more unified object of metadata and range information in IR It’s been updated to include some additional variables If I then operate on it with one of these intra-range operations like resize, that’s fine The metadata gets propagated along, but the ranges have been shrunk according to the resize operation If, on the other hand, I want to look for the gaps in this set of ranges, it’s not clear what you would do with the metadata here, so it goes away And this little piece here is telling you that there are lots of methods defined for the class IRanges, 300 of them So there’s a lot that you could learn You don’t have to learn them all But developers have created many different approaches to working with these ranges, and all of them get used in bio conductors activities Now, it turns out that we put the code in here, I guess, for the solutions for these exercises But let’s just look at this syntax, make sure we’re comfortable with it Write the code to determine the maximum width of ranges corresponding to six cylinder cars OK Let’s see how we would do that Here is our unified object We want to pick out the elements that have six cylinders So that’s this one, this one, this one, and that one, and then find the maximum width of those intervals And actually, we talked about the original intervals here So let’s look at the syntax here Write the code to determine the maximum width of ranges corresponding to six-cylinder cars Take the [? mcols. ?] Pick out the cylinder variable Check that it’s equal to 6 for those entries Take the weight of the IR that satisfy that condition, and then get the maximum OK That’s R. That’s our user friendly language And here, show that the average value of miles per gallon for cars corresponding to ranges with start values greater than 14 is 18 something or other Do it again IR is like a vector So inside the brackets, I can put an expression that satisfies this condition that the start– which is another method on IRanges, it just pulls out the start values– if start is greater than 14, I want to keep those ranges, take their miles per gallon out of their metadata, and compute the mean OK So the syntax is a little bothersome But that is how you would solve those problems All right How are people feeling? You want to keep going? It’s 11 o’clock I think we can finish this one by lunchtime if we’d like There’s quite a bit But if you would like to take a break now, maybe we should take a 10 minute break AUDIENCE: Take a break SPEAKER 1: Take a break Got it See you at 11:10 And as I said, this is somewhat dry material And I had a question about visualizing these ranges Some of the PNGs are missing

Do you want to say anything about going to the individual machines, or are we just going to keep going with what we’ve got now? SPEAKER 2: We can go– I think we should just keep going– SPEAKER 1: Let’s do it for period three maybe Yeah SPEAKER 2: Yeah, and then hopefully we can test it out– SPEAKER 1: OK SPEAKER 2: [INAUDIBLE] SPEAKER 1: During lunch, yes Yes OK So I’m sorry about some of the issues with the visualization And I don’t know whether you guys will already have PH525x I think you do So if you want to visualize any of this stuff, you could start bringing in library PH525x And then I’m going to use par, which is part of base graphics– and we’re going to talk a lot about visualization in the last section of the class, so that will be at the end of the day But inside PH525X, there’s this plot ranges function that can be used to look at all the different variations that we have talked about here So par mf row– AUDIENCE: Sorry, where are you? Yeah, where– SPEAKER 1: I just wrote this now AUDIENCE: OK SPEAKER 1: So this is a new cell Where we are, you’re going to add a new cell to your notebook under exercises, let’s say OK? So we’re scrolling down– SPEAKER 2: We got through 18.125 SPEAKER 1: Metadata and indexing You know, it’s just a new cell This is going to be self-standing So you can write this anywhere in the notebook AUDIENCE: How do you add a new cell? SPEAKER 1: Oh, yeah How do you add a new cell? Hit the plus button See that plus button up there? That will open up a new cell And you need to make sure that you’re writing code So when you make a new cell, it can either be markdown or some other thing or code And you want it to be code And then you’re going to open up that cell and put in this text Library PH525x brings in the code for plot ranges par mf row equals C4, 1 is something I haven’t told you about, but it basically says, take the plotting range and turn it into a matrix which has four rows and one column AUDIENCE: Can you increase the font a little? SPEAKER 1: I can increase the font AUDIENCE: Perfect SPEAKER 1: OK, here we go All right So this is good This tests your plotting, your typing, and your ability to cut and paste The pattern is I’m going to do plot ranges of something And the parameter [? xlim ?] is going to be set for each one of these to the same interval, minus 3 to 40 And that’s going to give me an x-axis that goes like that And I want that to be the same for all of my plots And I’m going to do plot of gaps of IR, a plot ranges of resize IR 1, a plot ranges of flank IR 3, and a plot ranges of IR And at the bottom, that is the original set of ranges that we started with, IR So that’s why I put that at the bottom there [AUDIENCE MURMURING] SPEAKER 1: [MIMICKING MURMURING] OK So when you’re done, you will hit play And then you will see this picture [SIDE CONVERSATION] SPEAKER 1: If you don’t have PH525X– is that the problem? AUDIENCE: It says loading required package png [INAUDIBLE] SPEAKER 1: Ah AUDIENCE: And it’s tapping the page for the PH525x [INTERPOSING VOICES] I’m in the notebook for this one SPEAKER 1: You’re in the general notebook? Does it work or does it fail? AUDIENCE: Well, this is what I’m getting I think it might be– SPEAKER 1: Yeah, it’s working You’re fine Yeah AUDIENCE: Oh, it worked SPEAKER 1: Yeah The graphics may be slow to come on But if you type this in, you will get this [SIDE CONVERSATION] Come on, cut and paste Just cut and paste that four times and then change the inside [SIDE CONVERSATION] I see you squinting back there

AUDIENCE: [INAUDIBLE] SPEAKER 1: Yeah, you’re in good shape there AUDIENCE: [INAUDIBLE] SPEAKER 1: Yeah, and the three You got rid of the three, as well AUDIENCE: [INAUDIBLE] Yeah Pull the trigger [SIDE CONVERSATION] SPEAKER 1: OK What do we got here? Library 525X– AUDIENCE: PH 5– SPEAKER 1: PH, right Can’t be– Scroll down It’s going to take a little while to render the graph, but I think you’re in good shape [SIDE CONVERSATION] How’s it going over here? OK, you have an error [INAUDIBLE] That one looks good Right there, you need to get that one parenthesis comma out of there And get the one Yeah Now comma And now I think you’re ready to go AUDIENCE: Thank you SPEAKER 1: Yep Everybody got it going? [SIDE CONVERSATION] All right, the reason we’re doing this is to just make sure that we’re all on the same page with respect to the interpretation of these functions And we’ll come back to reduce in a minute So IR was the original set of ranges OK 1, 2, 3, 4, 5, 6, 7 ranges, each defined by a start and an end And you can see them in the code that we gave originally So if I change that a little bit and I put IR there and we run that, there you are These are the original ranges And what we’ve done is visualize different operations If we run flank, we’re going to– and run flank with argument three– we’re going to get a three base pair– a three base with three integer flanking integral here For this one, we get that For this one, we get that and that, and so on If we change the flank to some larger number, then those flanking intervals will be wider OK? The resize takes each one of these guys and shrinks it down to a single base at the start There are other parameters that you can use to do other things with resizing But this one just takes each one of them and makes the seven ranges that consist of a single base or a single integer of width And then finally, the gaps says look at this pile here and show me only the regions that are unoccupied OK Now I’m going to change the flank one to reduce, just so we can see what reduce looks like We will run reduce on IR and run it again So now instead of flanks I have reduce What is reduce doing? If you look at your original– look at the original IR Reduce says I want to put together all the occupied regions in the smallest number of intervals that I can get So that means I just take this because it’s all occupied continuously and I get back one interval here And these two guys have no complications at all They just get preserved So this is a reduce of this original collection So one could say, well, if this is my pileup of reads, then this is maybe my transcript If you’re doing some de novo discovery, that’s the usefulness of reduce, one of the uses of reduce OK, any questions about this? AUDIENCE: If we add reduce to that window, it’ll just add another plot to the image? SPEAKER 1: You may get into a little trouble because if you add it here, you better increase that so that you have five rows OK? Coordinate the layout of the terminal, which

we don’t like to talk about, with what you’re asking it to do All right, a little improvisation to sweeten the exposure to IRanges OK Now, we haven’t talked about overlaps So now I’m going to take two groups of intervals I’m going to let IR one be the ranges that are the third and seventh and IR two be all the ranges excluding the third and the seventh That should be familiar to you from R If I want to keep just a few things, I put an integer index vector in my vector If I want to get rid of them, I turn it to a negative So now IR one is this pair of ranges And IR two are these five ranges All the metadata comes along And now I’m going to use find overlaps OK That’s a pretty natural thing to want to do Take IR one and IR two and look for overlaps Now, if we come back to our little graph here, it’s not going to be so easy to do because I haven’t defined IR one and IR two But I can take that code and use it here This is actually not a bad exercise Let’s say we just want to plot IR one and IR two I use plus here I already have my I range, my PH525X So I can do par mf row equals C2, 1 to have a two row plotting surface Plot ranges IR one [? xlim ?] equals C minus 340 And plot ranges IR two [? xlim ?] equals C minus 340 OK Let’s hope for the best No good I don’t have IR one or IR two, so I guess these were never run Let’s run them and then run our plotter There they are Now IR two is kind of large Lets reduce our size of our– whoops Something strange happens when you– there We want to look at IR one and IR two in context AUDIENCE: Can you slide down to the code again? SPEAKER 1: The code The code is– AUDIENCE: There it is SPEAKER 1: Yeah I’m setting up a two row plotting surface And I’m just asking for the ranges from IR one and the ranges from IR two, a total of seven ranges And all I’m going to do is try and understand how find overlaps works So what do you think? If I want to find the overlaps between IR one and IR two, what should the answer be like? It’s kind of a good applied problem Suppose these are your reads and this is your gene model I want to know which reads are hitting my genes Or these are my promoters, something like that So what is that going to look like? Well, for each one of the reads I want to say, yes, it’s in IR one or no, it is not And likewise for IR two It’s a pretty complicated process You’re going to say, well, what’s the overlapping relationship here? So here it is Let’s say I do find overlaps I’ll blow up my fonts now If I do find overlaps of IR one to IR two– let’s run this here– what I get back is what’s called a hits object And it has three hits Why does it have three hits? Well, if I think about the query, IR one, it is going to overlap this one, this one, and this one I don’t think it overlaps this one Maybe it does Maybe it fails to overlap this one This one, this one fails to overlap OK So these three guys are somehow overlapping that And this one doesn’t get anything

So the subject of my query was IR one IR one, this guy, hit three of these intervals And this one didn’t hit any of them And the hits object has two components, the query hits and the subject hits And it’s telling me that of the query, IR one, only the first interval had any hits And for IR two, the first, third, and fourth had hits And that is in terms of the original ordering of the ranges in IR one and IR two OK So let’s just spell this out a little bit more IR one, IR two, IR one query hits of f0 and IR two subject hits of f0 We can run all that code We run the find overlaps and we look at it We go back Here’s IR one and IR two IR one had two ranges And IR two had five ranges And if I can ask which of IR one had overlaps, it’s always Mr. C. And if I ask which of the IR two were overlapped, then I find it’s these three ranges Because 14 and 19 overlap at 14 14 and 19 overlap with 15 and 29 And 14 to 19 overlaps with 19 to 24 AUDIENCE: Can you back up to the code? SPEAKER 1: Yep We’re just breaking everything up so that you can see what’s going on The find overlaps generates this hits object The hits object can be operated on by a query hits function or a subject hits function And that tells us which of these guys actually had overlaps with the query that we posed OK So that’s all abstract working with interval ranges that we don’t have any real interpretation of Let’s go into the world of genes and then maybe you will see more motivation from a genomics perspective And that is what G ranges are for Are there any questions about this overlap? It’s a little complicated, but if you give yourself some time I think you’ll understand that this operation has some intrinsic complexity to it You’re asking, which part of this overlaps with this part of that And you have to coordinate that information and the hits object takes care of that for you One other operation is here I can use percent sign over percent sign as a binary operator, which will tell me for each of the ranges in IR one whether it overlaps with IR two or not Now, how about in the other direction? Anyone want to tell me what’s coming out there? What’s the length of this second one? Remember what IR two looks like This is it If I’m asking whether this IR two overlaps IR one, how many results am I going to get? One person has flashed a signal Four? AUDIENCE: Five SPEAKER 1: Five The answer is five I’m asking, for each one of these ranges, does it overlap any of the ranges in IR one? Yes or no, that’s all There will be five answers So let’s run it It’s this one here True, false, true, true, false So that’s telling me that the first one, the third, and the fourth overlap with IR one And that’s exactly what we saw with this query hits, subject hits The subject hits of our fine overlaps is that 1, 3, and 4 have overlaps This will become second nature to you once you start working with genes and exons and so on OK Questions, comments? All right How are we doing? 11:31 Lunch will be served at noon, I assume? SPEAKER 2: Yeah, somewhere around noon SPEAKER 1: Somewhere around OK Always approximating Well, I think we can make some good headway here

We might not finish by noon, but we’re going to keep going So let’s start again G ranges is telling us a lot more than just ranges tell us Let’s look at what it does Well, there’s a seqnames The seqnames is coming at the issue that we’re generally going to work with multi chromosome organism contexts And so humans have 22 chromosomes plus some sex chromosomes And you want to know which chromosome a given gene lives on So we’re going to always have this seqname structure It’s very general It’s not a chromosome It’s a seqname So they’re saying that this organism is going to have a genome that lies in a collection of sequences And we’re going to make a mapping from some entities to those sequences using the seqnames And you’ll see this little guy here, RLE That stands for Run Length Encoding And that’s a very efficient way of dealing with potentially repetitious information So for example, a lot of the data are going to consist of chr9 repeated many times And so we just use a run length encoding to count the number of repetitions and the value That’s somewhat technical, but those of you who like computer science details would like to know that The same is the case with the strand So we have a sequence We understand that many of our organisms will have multiple strands And then the genes are actually going to occupy certain intervals We use IRanges to deal with those And then we can have any kind of metadata we want Each range gets some vector of metadata elements Question? AUDIENCE: Can you use the run length encoding to find the number of repeats of chromosome eight, for instance? SPEAKER 1: Can you use the run length encoding to find the number of repeats of chromosome eight? Well, yes You can do all kinds of things with the run length encoding But I’m not– your question is complicated Let’s just take a quick look at it So I’m going to make sure I ran this And there we go Library homo sapiens takes a little while to load And there We have that now OK, so let’s look at the seqnames of HG And in fact, I will comment out HG now so this is all we see There So it’s getting complicated This is a factor RLE of a certain length with a certain number of runs And the runs are the important part, OK? And they’re not that much smaller in this ordering than they could be But you can work with this You can do something with the seqnames, let’s say save it to rs equals that And then you can start operating on it, for example, to see that run values of RS one to 20 That’s a new vector I think that works No, run value, I think it is Yes, run value like that and run length, RS So I hope that answers your question OK, good So we have these wonderful things, HG, the human genomes, obligatory metadata structure called seqnames that gives the chromosome occupied The strand features have a biological direction, and that is for you And in terms of IRanges, the plus strand go from start to end and the minus strand go from end to start We’re not going to get too deeply into this issue of strand But the information is there And it’s well-documented Now, here is an ordering of the genes so it makes a little more sense The first chromosome comes first, second chromosome the second, and so on And now the RLE is a much more efficient representation of the data But you might ask the question, well, why don’t we also put the genes in order in terms of the chromosome position? And that’s an exercise [INAUDIBLE] code that orders the ranges by starting position within chromosome And there are two ways of accomplishing that One is strant invariant, so you just take the start position and you use that as a second argument to order Order can take multiple arguments And it just orders hierarchically

Or you can use sort So it’s interesting to know that these have different results So if I add this next one here– if I do sort of HG, it’s not the same When I use this approach to just order by start, I get the strands mixed up If I use sort, it puts the positive strand before the negative strand AUDIENCE: Can you go back to the code? SPEAKER 1: Going too fast? AUDIENCE: Yes What is the code? [INAUDIBLE] SPEAKER 1: Yeah And these are entree gene IDs So NCBI has annotated a given gene to this interval on the chromosome, told us what strand it’s on, and all the genes, all 23,056 are available to you in this way Question? Yeah? What? AUDIENCE: I just wanted to add, because I can’t help myself, [INAUDIBLE] Besides the [INAUDIBLE] stuff behind RLE, what else I think is so cool about it is you took these R courses before you started here in learned about basic vectors and stuff like that And now there are all these exotic new things like RLE and IRanges But most of the stuff you learn there still works, which is, I think, just amazing So if you want to know about the fancy stuff about an RLE, that’s fine But if you don’t, you just treat it like a vector like you learned, and that stuff still works like sort and order and square bracket subsetting just all still works SPEAKER 1: Yeah Here’s a nice one So I’ll just add one more cell And if you want to know how many genes there are on different chromosomes, table seqnames of HG should work And it does And so if you wanted to tell your friends, well, there’s different numbers of chromosomes on genes, you can start seeing that Now, that doesn’t look so good One of the issues here is that I have 96 some-odd chromosomes It’s not really what we intended So you would have to say something like keep standard chromosomes here So you just go and make the onion a little bit thicker Keep standard chromosomes And I think we have to say something like prune Prune equals course I don’t know whether that will work Yeah I don’t know But there are ways to prune it down to the 23 canonical chromosomes And we’ll get to it later I just don’t remember the exact syntax So as I believe I was just saying, these GRanges can be treated as any standard vector You can just pull out four elements or you can sort them And you can set the strand to star if you don’t care about strand And then the strand will be ignored The seq info, as I mentioned to you before, is a separate data structure that tells you the lengths of the chromosomes because you might do operations on these ranges that actually take you beyond the end And you’d like to know that And you are sort of automatically warned if you’re using standard operations to do that that you’ve tripped over the edge of a chromosome And then how many circular chromosomes are there? Well, just the mitochondrial one And we did the table there So if I want to limit myself to just chromosome 22, this is the type of annotation I can use– seqnames, check it against chromosome 22, and I use which a lot, which just gives me the integer indices And this just gives me back the genes on chromosome 22 There are 535 of them And here is that keep standard chromosomes function with pruning.mode equal course It gets rid of all the random unassigned sequences, keep standard chromosomes OK So now there is also more structure, which is that genes are not just contiguous intervals

on the genome, but they are composed of UTRs and exons and introns And in this case, what I’m going to do, instead of using genes, I’m going to use a function called exons by on homo.sapiens with by equal gene And that is going to enumerate the genes And for each gene, it’s going to have a separate GRanges that tells me the positions of the exons OK Now, gene models may be complicated And we have our plot ranges, so why don’t we see whether we can actually plot one of these gene models here for ebg1, let’s say So ebg1 is ebg is this exons by Now, that function is a little slow Exons by is digging into homo.sapiens and pulling out many, many exons And here is just the first one, the first gene Now, what I’m going to do is get the ranges out of there And I think if you say ranges of ebg1, you would just get the ranges Now, I should have just saved that Yeah, very nice So we’ll do that and then we’ll use our plot odd ranges There So now you’re seeing all the exons And the work the way plot ranges works is if some of the ranges that you’re asking to plot collide with one another, it’ll make another column So this isn’t the gene model in the sense that the transcripts are connected to one another This is just showing you all the exons and how they are sorted in this particular gene body We’ll go into visualization at the end, which actually will look at transcript– AUDIENCE: Can you scroll up a little bit? SPEAKER 1: Yeah, yeah Of course We took ebg1 and we just ran our plot ranges function on it So this is just to orient you with what’s going on when we get these exon collections from GRanges OK, so here’s our exercise Note that unlist ebg is officially computed and retains the gene entree ID as the name of each exon range Give the entree ID of the gene with the longest exon And unfortunately, the answer is there, but [INAUDIBLE] So what do we need to do to solve this exercise? Yes? AUDIENCE: We have to run it, then sort the columns and choose the– SPEAKER 1: Here, use this AUDIENCE: Oh SPEAKER 1: No, you AUDIENCE: Yeah I think we have to run the function and then sort and just choose the first in the sorted list SPEAKER 1: Well, which function are you talking about? AUDIENCE: Exons by That– yeah SPEAKER 1: OK, so let’s see what we got here I think we already have ebg, right? So now you’re saying we can do something with this to find the longest exon What are we going to do? Sort it? AUDIENCE: To find the largest range– I’m not sure SPEAKER 1: It’s OK So I gave you a little hint here We need to unlist So the exons by gene is a list of GRanges So let’s take uebg equals unlist of ebg Now I think we’re getting closer to what you’re proposing In that uebg, we can print this out, as well Notice that we have this nice way of abbreviating the screen dumps You know, if you list it out, these 270,000 exons, that would be a pain in the neck Your screen would just keep on scrolling on But we’ve put these nice– there are options that can say how many do I want at the top and at the bottom But that’s a very nice little feature there So you’ve got uebg and you’re trying to find the entree ID of the longest exon You have a clearer idea of how to do that now? It’s R uebg One of the nice things you can do is say which.max width of uebg

I’m just using R here OK, so let’s think about what happened uebg is just a GRanges, therefore, width has a very trivial implementation for it Which.max goes and finds the index of the largest value of its argument And then I just use brackets to get back that particular range Well, fortunately, it keeps the meta data So I know now that I’ve got this gene– this is the entree ID of this exon, which has the largest width of any exon Now, here’s another problem What’s the distribution of exon widths in the human genome? Yeah? AUDIENCE: 1.5% of the– Human calculator SPEAKER 1: Oh, yeah That’s a great one All right Interesting So I was looking for the distribution in the following sense, hist with uebg Take a histogram It’s a horrible histogram because there’s one really large exon So we do something a little more informative by taking a transformation– let’s say I take the log Let’s do log 10 Now, I think your point is different You’re saying how much of the genome is made up of exons AUDIENCE: Right SPEAKER 1: OK, so now here I go I’ve got some information on the distribution of exon width and I can see that the mode is about 100 bases And if you look at the mode, you start to see some regularity You start to see, well, there’s a lot of exons that are length 96 and a lot that are length 108 It might be interesting to understand why that is But that’s something you can explore Now, your point was, well, let’s figure out how much of the genome is actually taken up by exons And one way to do that, I guess, would be to say, well, I’m going to take the reduce of uebg and take the sum of the width So then I’m putting together all the ones that are lying on top of each other And let’s see if we can get that No good Syntax error Fix the syntax There you go 83 million So you take 83 million and you divide it by 3 times 10 to the ninth, right? 2.8% You’re close I don’t know whether that’s the right answer or not, but that’s our calculation here, OK? Right, so strand aware operations– remember now, we’ve got GIR here I’ll split this so you can see what I’m doing here And let’s not do any plotting quite yet Well, we’ll just go with it So what is GIR? Remember our old friends IR? 10 more minutes, guys Our old friends IR were seven intervals And what we’ve done is we’ve embedded them in a GRanges object and we put a chromosome on them and we told that the first four are on strand plus and the last three are on strand minus And so now we’ve taken this abstract set of intervals and put them on a chromosome And now we can use plot GRanges with all the operations that we learned about with respect to IRanges So for example, the flanks of GIR with start equal false, these are all going to be ending flank intervals And you have made some changes to this so that you can see better which strand we’re on SPEAKER 3: Yeah, there’s– I could give a link I just tutored something that’s just another plot GRanges function that colors the different strands different ways SPEAKER 1: Yeah Yes, that would be nice So within here, some of these are on the plus strand and some of them are on the minus strand, and it’s going to make a difference Let’s just try to understand the resize before we go on

So these are the original ranges on chromosome one And I’m going to run the resize command to get– what this would be is, let’s say if these were genes, resize GIR one would be, give me the transcription start side of each gene Give me the first base of the gene body And you can question that But that’s the basic idea So here we are This interval mapped to this This one mapped to this These are both on the plus strand, so that is fine This one is on the minus strand And its resize to one is to the right Likewise for this one So that idea of taking the strand context and reinterpreting these operations so that they make sense given the strand is taken care of for you OK Now I want to do some real human biology if you feel comfortable where we’ve gotten to now We’re going to work with the GWAS catalog, Genome Wide Association Studies OK Every month or so used to be at NCBI they would go through the literature, find all the Genome-Wide Association Studies, take the SNPs, and see whether these SNPs have been replicated And if they were, they put them in a catalogue What kind of SNPs? Well, they’re SNPs that have been shown in some sort of cohort or population study to be associated with some phenotype of interest So here we go We have a library called GWAS CAT And there’s a data structure in there called EBI CAT 37 And it’s a type of GRanges It’s a slightly extended GRanges because there’s so much metadata recorded on each one of these studies that if you just treat it as a plain GRanges, you’d never be able to show it nicely So I wrote a little method that just wraps this information And it only gives you three of the pieces of meta data What is the disease? What is the name of the SNP? And what was the p value of the association? There’s a lot more information in there, which we could get at But right now these are what are presented in the show method So for here, we have 22,000 records recording phenotype SNP associations This is a fundamental tool of genetic epidemiology I’ve got links here For example, this link takes us out to the catalog And just so you get a little clearer on this, there’s this nice diagram You don’t have to do this But this is presenting it in another way Each one of these is a chromosome and what they’re showing you is how many different associations have been found for different phenotypes by analyzing the genome and genetic epidemiology Very impressive So you’re going to work with that data here You have it as a GRanges And so we run this piece We run the library, GWAS CAT And then we assign, because it has no– it has a certain genome tag here, GRCH 37 That’s how the EBI guys refer to this And the Genome Research Consortium comes up with their own build names But for us to work with what we’ve got, we need to use the label HG19 They’re the same genome build, a slight difference in the way they deal with the mitochondrion But let’s not worry about that So we’re just going to re-label it to HG19 And now we’re going to do something which is actually pretty amazing We’re going to ask for the overlaps between the genes bodies that we’ve got in HG and the Gnome-Wide Associations Catalog We want to find overlaps There are 20,000 some-odd genes and 22,000 associations in this version of the catalog And we get back the hits object So what is this telling us? The query was the set of genes The subject is the set of genome-wide association studies And there are SNPs And this is telling us that the second gene in our gene catalog overlaps the 140th SNP in the catalogue of GWAS All right

So that can be used to figure out which gene overlaps which SNP And you can ask very general questions For example, how many unique genes actually have associations of this type? Whoops, I didn’t run that yet So run this here and this here There’s our find overlaps And now we’re going to get the length of the unique query hits So even though there are 12,000 events of overlap, you can see that there are some genes that overlap multiple hits So we want to just reduce down to the number of unique genes And that turns out to be 4,700 And then if we want to ask, what is the frequency with which SNPs lie in the gene bodies as given by our catalog, 51% of the SNPs in this catalog lie over genes Now, the genes are made up of introns and exons And so if I ask for the frequency over exons– and this is sort of a nuisance because most of these are not in coding regions Only 7% of our SNPs lie in coding DNA So there’s a mystery They must have regulatory effects They don’t just change the proteins that are being produced And that’s one of the reasons that genetic epidemiologists are in such heavy business And then here we are Can use promoters, for example This is a simple definition of promoters, just says go upstream from each of the genes 20,000 bases and then ask whether your SNPs over lie these promoters And that tends to be much more frequent than overlapping coding DNA OK? Any questions about that? These are pretty powerful questions that have been posed with just a couple of lines of code Yes? AUDIENCE: Simple question This works wonderfully for the human genome, but how about fly, or– fly or sea urchin or some other– cause this is all based on the– SPEAKER 1: If you did a GWAS on sea urchin– did you? AUDIENCE: There’s a lot of GWAS on fly data SPEAKER 1: Yeah, great I don’t know where the GWAS catalog is for flies But if there is one, we turn it into a GRanges We do have a D Melanogaster G– we have these things And you would just plug in the catalog for your fly and the gene catalog for your fly and you’d do the same things AUDIENCE: Cool SPEAKER 1: All right Yes? AUDIENCE: Is there a way to create a catalogue for an organism? Because you can make the [INAUDIBLE] SPEAKER 1: Yes AUDIENCE: [INAUDIBLE] right? So is there something similar that will help you create catalogs or any of these sort of generic resources that are out there? SPEAKER 1: Yes, as a matter of fact, there is And you can use BioMart or you can use Ensembl They give you GFF files or other very standard representations of these We have functions that will convert them into these types of structures, yes I didn’t get into that here But maybe for the second course, if there’s a big interest in that, that can be covered OK, lunch is here And I think we can stop This is a pretty good point Any questions before we break for lunch? We can start in an hour’s time, I think OK, thank you SPEAKER 2: Again, please feel free to ask any questions during the lunch or [INAUDIBLE] Is that mic off? All right I hope everyone’s well-fortified after their lunch So one– I need to make a couple of one quick announcement Please, you can save your notebook by going to File and clicking Download As And if you save it as HTML, it’ll be a fully encapsulated HTML file So do that for every period it as you’re going through so that you can take it home with you But not to worry, we will also have these note books available afterwards, as well, online, in case And in a few days, we’ll also have the videos and things so that you can come back to the material if needed So with that, I’ll let Vince take over again SPEAKER 1: OK

Thank you, [? Ashok, ?] and thank you for coming back I think we are– let’s just review where we’ve come thus far We started out talking about R and Bioconductor, the purpose of it And we moved into data management And we created this thing called the expression set out of a couple of tables And I think that all went well We brought in some data from geo And we saw that the expression set was used to represent 12 samples Then we saw that those samples had a structure There were six controls and six treated And within each of those groups, there were three normal cells and three glioblastoma cells So what you’re seeing is that genomic data can be thought of in terms of these tables, feature names, sample identification, and so on in a coordinated way And then we changed gears and we went into these IRanges And as I said, I feel that’s somewhat dry But it’s actually essential to this new generation of assays You see, when we were working with the micro arrays, there was a fixed number of features that was spotted on the array or synthesized algos And that was it You had everything that was on the array and nothing else In the domain of next generation sequencing, however you want to think of it, that fixed feature that no longer exists Instead, you have freedom to interpret the reads that you get out of your sequencer any way you like You can align them You can look for overlaps of certain genomic regions that no one else ever thought of And so to have the flexibility to deal with assays that do not have this fixed feature set in advance, we had to change gears We had to have a new paradigm of how to represent our data We want to keep this idea that there’s an object that coordinates all the sample information and the assay information We want to keep that idea We want to think of that as an R array so that there is X and then there’s a bracket and some feature selection and some sample selection and close the bracket And that makes sense But we don’t have the convenience or simplicity of this fixed feature set We need to know how you want to organize the genome to represent your features as you count up the reads over the genome So that’s why I spend so much time on genomic ranges Now I want to move to something which I think is much more down to earth for many of you, how to use these genomic ranges to work with the outputs of experimental systems or assay technologies or what have you And we’ll start out with a collection of BED files How many people have worked with BED files? A good number of you OK So the definition of a BED file is it’s a textual file, typically tab delimited And it has information about an interval on the genome And there’s a very specific way of ordering the fields in the BED file so that this standard is met And tools that know how to produce BED files can operate effectively without any special interventions Now, what I wanted to do in order to think about genomic ranges in BED files was to have examples that are relevant to the epigenomics roadmap And just in case you haven’t gotten acquainted with that, a lot of this information comes out of the ENCODE project And the basic idea– I’ve probably done it too much here– is that there’s a lot of tissue-specific assays that have been done Many of them are CHiP-seq And we can initialize this grid visualization here to get some sense of what’s going on there Yeah There we are So this is the sort of thing we wanted to have some familiarity with These are histone modifications that CHiP-seq experiments are able to identify in genomic DNA And they all have certain biological implications for the regions in which they are identified That’s one of our dimensions here And then another dimension is the tissue in which we’re going to derive DNA

and all its epigenetic modifications and try to figure out what those modifications are, tissue by tissue OK So this is sort of bread and butter bioinformatics You have to be able to work with something like this in order to understand how to solve questions about gene regulation, and ultimately understand how these tissues are differentiated from one another, how disease tissues are differentiated from healthy tissues, and so on OK That’s why the erma package was created to help people understand how Bioconductor can work with data of this nature Now, we can run this little block here to load the erma package and make something called an erma set And an erma set object– here it is– has 31 files So it manages information about 31 files that are sitting on your computer, in my computer, or on the computer where you’ve got your R session running And they are all BED files, gzip BED files In fact, I’m pretty sure they’re tab indexed so that we can make quick extractions given addresses that we are interested in So if you have a bunch of BED files, you can wrap them up in something called the genomic files object And this erma set is just a slight variation on genomic files– again, dealing with the fact that we want to be able to manage the data And this is a very nice little technology here The dt package will take a data frame and generate this searchable HTML table for you where you can get all the information about the different BED files that have been organized by this epigenomics roadmap So there’s plenty of metadata there, all organized and bound into this ER set So if you have a bunch of BED files and you have a table that describes each one of them, this table can be put into this genomic files object as the col data So we’re thinking of features as rows, columns as samples, and col data is the information about the samples There’s a new model It’s not expression sets It’s a bit more general AUDIENCE: Does the pr set always populate with whatever’s on the local disk for BED files? SPEAKER 1: No, I had to– this is a special little function here Let’s take a look at it Make erma set OK, so we will run this cell here Here’s the function So it turns out that there’s a collection of BED files sitting in a folder And it’s actually with the installation of the erma package This set of BED files is sitting in there And we can get the paths to them And we supply those paths to this genomic files function And then we take care of some issues with the col data And then we hand back this erma set So if you had your own BED files, you’d be up somewhere around here saying, I’m going to set this genomics files down That’s going to give me one object that refers to all these files I’m going to set up the col data which tells me about each one of them And then I have a genomic files object that’s organized OK All right, now the way that our epigenomics roadmap work with the data is somewhat involved But what they did is they tell you something about the origins of the DNA that was analyzed They come from different cell lines, many different types of T-cells I only took a small number of the hundreds of epigenomes that have been analyzed now and I was focusing on immune cell subsets So that’s what this dataset is focused on And there’s a field in there called anatomy And you can use R, then, to get the anatomy variable out of this erset and tabulate it and see that most of the data is coming from the blood There are seven samples from brain and so on And there are more details about the actual tissues that were sampled And you can get that by looking in that table All right Now this is where things get really interesting I have this erset It’s made for you by the erma package I can use the R track layer package to import a piece of one of those files Which piece? Well, I’m going to use the GRanges to say I want all the positions between one million and 1.1 million

on chromosome 1 What do we know about this particular epigenome in terms of what’s going on in this part of the genome? Well, according to the assay that was used and its interpretation through something called ChromHMM, which was developed at the Broad Institute, this part of the genome for that type of tissue is a bivalent promoter component This part is repressed polycomb You have to know these abbreviations I think I have a link to it somewhere here where you can learn more about how did these guys figure out to call this part of the genome bivalent promoter, this one poised promoter, this one a transcription start site, and so on What we’ve done is we’ve taken genomic sequence and labeled it according to its likely regulatory function And there are ways of analyzing sequence or analyzing patterns of methylation and histone modification and so on that lead to these classifications They are statistical in nature You might want to actually get some of this tissue and see whether this is the right– you have some wet lab assay to see whether this is actually the right way of thinking about this piece of the genome But here we’re just using the statistical inferences of Manolis Kellis’s lab to come up with these ways of characterizing DNA from this particular cell type OK, so here we go That’s a look at a particular piece of the genome for a particular cell type And the next one is looking at the next cell type And there it is And you can see that in this interval, there is a slight change when we get to– oh, I don’t know– yes, we look at the end towards the 1.1 million The first cell type is said to be an enhancer but the other one is a promoter So there are different epigenetic patterns observed in these two cell types at that interval of the genome OK Notice that there are 68 intervals for the first cell type, 72 for the second There is no guarantee that the intervals are common across the different cell types That’s the nature of the BED files The BED files just tell you a bunch of intervals There’s no restriction to a common set of intervals over a set of BED files Now, here is a way of looking at changes in state for the promoter region of a given gene for a number of different cell types So it’s called state profile It’s a function And remember, I’m using this annotation I’m thinking of my erset as a matrix This says I’m looking at all of the– I’m not going to restrict it with respect to the features I’m going to restrict it with respect to the samples And the samples that were actually retained are different tissues And we’ll go down to the bottom here So what we’re seeing is that the region that we looked for is just a little bit upstream of IL 33 R, I think it was And so that corresponds to a certain set of genomic locations And if we have tissue from the adult lung, we see that there is enhancer activity in the promoter region, let’s say here But in the fetal lung, there is more of a poised promoter or perhaps an enhancer here OK So this is helping you to summarize possible differences in epigenetic state around a given gene, the gene we selected here, among these different tissue types IL 33 is the gene OK So what we’re saying is that by managing our BED files in this way and writing some functions that extract information about the different epigenetic states that we have, we can make a display of this type to show people that there may be interesting regulatory differences for this gene that differentiate cells from different organs So to summarize, BED files are representations of outputs of genome scale experiments You can have a BED file that’s just telling you where the peaks are, let’s say, in a CHiP-seq experiment Or you can have a BED file that’s telling you a refined interpretation of a bunch of CHiP-seq experiments, where what’s been recorded on each one of the intervals is just some aspect of chromatin state

or an inference about regulatory state Collections of BED files can be managed with genomic files objects We have something called col data which tells us about the samples We can restrict information, restrict attention using row ranges And we can use import in the R track layer package to get the information that we’re interested in into our R session And state profile is one way of visualizing these things Yes? AUDIENCE: So I have a question about sort of doing most of the analysis using these kind of objects, for example, this is an erma-type object But let’s say that I only want to do half of the analysis in this particular pipeline, but then I want to take the data out and do a ggplot figure or something like that How can you basically subset any of this data and turn it into a regular matrix for ggplot, for instance? SPEAKER 1: Yeah, that’s a good question So as you say, ggplot knows how to deal with data frames You take data frames, you make some bindings for the aesthetics, you get some geoms, you get some stats We’ll talk about that at the end of the day Do you want to export data from here for the sake of a customized visualization? You can do that Let’s see what’s going on in state profile, which may help you see where that information is coming out OK? Because state profile may well be a ggplot I think it is, yeah So let’s quickly look at this function It’s actually running the state profile There it is So you’re getting into some advanced stuff here But let’s see if we can read it OK? So the gene model for the symbol that we’re asking for is just going to give me a GRanges for that interval, not very interesting Then we resize it to find the transcription start site So that’s something you already learned about Then we have a promoters method here, which just goes and finds the upstream region that we’re interested in OK, no big deal Then over here, we’re going to run this thing called liberal import And the liberal import is going to each one of the BED files and getting the numeric data you might be interested in, or the character data, for all of the intervals So it’s this import step that’s going over each one of the elements in this object that’s actually extracting character or numeric data And then, why we use liberal import is a long story It has to do with the fact that there’s no coordination between the BED files on which intervals are actually going to be pulled Now, here we are building up There you go ggplot, you see? So everything that’s going on here is taking that data, which is all this uncoordinated, smoothing it out so that it can fit in a dataframe, and then running ggplot with geom_rect in order to get the different colors in place, and so on So it’s all in the source there to do that And it’s up to you If you don’t like this particular way of doing it, maybe you can pull this code and make some changes and then get a different dataframe that you do something else with, OK? All right I had no idea I could answer that But you know, you’ve got the R there, so you can often get somewhere OK So, any other questions? That was a good one All right Let’s go into one exercise here There are 15 samples with anatomy labeled blood List their cell types using the standardized epigenomic dot name component of col data Well, I think the answer is here Let’s just go over it I can subset my erset using the column index to just take those which have anatomy equal to blood And then once I’ve got that, I run the col data function to get out the col data and just pick out the standardized epigenome dot name And then I get this list, which is somewhat unpleasant to read So if we change it a little bit– given our wonderful IPython notebook or Jupyter notebook, if you just change it a little bit, you can do that And now you get a beautiful table And I think I did it twice OK That’s the exercise Can you go in and find the variables and tabulate them the way we ask? Yes? AUDIENCE: Can you scroll up? SPEAKER 1: Oh You know, it’s ugly, dollar signs, double equals,

dollar sign, and so on But once you get that sense that I can go and get some metadata by using dollar sign here, subset the big data by doing that, and then get another piece of metadata out using dollar sign again, you can do such things very naturally Maybe they’re a little too natural for me I need to forget All right, now we’re going to go to something which is more for the RNA-seq people AUDIENCE: Can you go back to that– SPEAKER 1: Ahh You didn’t get your dollar signs All right The next topic is going to be for RNA-seq mavens Any of you who are familiar with the Tidyverse or Hadley Wickham’s stuff in R will know that this is anathema to people, where you have to constantly refer to the object as you make subsets And that’s why we have this wonderful dplyr chaining of arguments and so on It’s not well-developed for Bioconductor tools yet, so you wind up having this But it all makes perfect sense And once you get used to it, it’s fine OK to move on? All right BAM files So there’s a very nice package called RNA-seq data By the way, I always put the links into the packages– or I try to, anyway So if you click on that link, I open the link in the new window, you go to the man page– not the man page, but the package download page for the package that has been referenced And this is telling you all about the package here And it includes links to the place where you actually get the data and so on So it’s nice to have these Jupyter notebooks with all their hyperlinks In this case, we’re going to run a little bit of code that works with this RNA-seq data package And what has happened is, I think there are 8 files, right Eight files from this study where only the reads that map to chromosome 14 have been retained So we make a genomic files object And this is just a constant This is just a list of eight file paths And you get them So now you have a new object called gf And you’re some hints here Here are the files we’ve got There are eight of them And we can use things like files or row ranges or col data to work with this to make extracts to learn about what’s going on Now, the features that we’re interested in are defined by GRanges It turns out that this is a GRanges corresponding to this gene, HNRNPC, which was knocked down in this experiment So what I’m going to do is take this region of interest, a GRanges, and bind it into the gf object as its row ranges And now it’s no longer a genomic files object with zero ranges It now has one range OK? It’s still just an object There’s not a lot of data in R. It’s just sitting on disk And I’m going to do a little bit more I’m going to make a col data for it which distinguishes the wild type samples from the knockout samples That’s just a dataframe with this vector as the variable trt And now we’re going to run– make sure I ran this here Run this here Keep going OK Now we’re going to run something that’s a little involved I’ll just start it now We’re going to use a function called summarize overlaps that will work through the files by looking in its row ranges and determining which of the paired end reads fall into the range that we have asked for, which is specifically the region for HNRNPC I put a suppress warnings around it because there are some unaligned reads, I think, that it wants to complain about We just leave that off for now And there it is It’s done So what happened? We ran this function, summarize overlaps

We have one region of the genome that we wanted to look at And it created this thing called a range summarized experiment This is the new version of expression set for sequencing data And it has a schema I think I show it here There So remember the old– the freehand that I showed you for the expression set This is a nicer picture that was actually published in Nature Methods that shows how we organize data on sequencing We need to know information about all the different genes that we’re going to be studying, or any type of feature And in particular, we’re probably going to have ranges that tell us where on the genome they live And those are the ranges we’re going to use for our read counting Then we have the actual data These are the matrices, usually one matrix for the counts, let’s say But we might want to have another matrix of the same dimensions for FPKM or what have you And then, as usual, there’s information about each one of the samples stored in another table And then there’s information about the experiment as a whole And these all get wrapped together in one entity, which we call an instance of the summarized experiment class And what has happened for us here is that we’ve gone from a bunch of BAM files to a summarized experiment Now, it’s a very special type of summarized experiment because it only has one feature that was studied But if you had a bunch of ranges in this row ranges, then there would be one feature measured for each one of the ranges there And there’s now a method– it’s not exprs anymore to get the expression data out, but it’s just something called assay, which actually pulls the data out So let us run this little assay command here Make sure that runs And then we see this table And the nice thing about the table is that these were the wild type samples and they actually have a reasonable number of reads falling in to the HNRNPC interval And the knockout have relatively low numbers OK And when Levi speaks, he will tell you how to do this the right way, and do correct statistical inference on the differential expression in RNA-seq studies I’m just showing you how we manage the data OK? All right So we are finishing up interval two This is another experiment This is called the airway experiment And all you do is bring in library airway and data airway This is a much different structure There are eight samples, but information is recorded on all of 64,000 ensemble gene entities, and some other things And this is data that I guess came out of SRA And the call data is as you would find in SRA And if you run these commands, all of this gets taken care of for you We package the data as an illustration of how to work with count data in a relatively small RNA-seq experiment And what we illustrate here is as you know What is this expression doing? It’s taking the airway summarized experiment and sampling down, taking only the first six genes and the first five columns, using the brackets, and generating a new summarized experiment which has only those features and those samples, to which we can apply the assay method to get back the counts Now, when we look at the row ranges, the row ranges actually record more than just the gene intervals They also tell us the addresses of the exons And the counting of this experiment is described in these notes, how to do the counting We’re just working on the data structure right now The col data looks like this There were some samples These are actually paired samples Some of them were treated with dexamethasone Some of them were not treated And this part of the experiment, they also did some treatments with albuterol, but none of the data

that we reserved in this experiment have albuterol treatment So the differential expression study you might like to do would be to compare dexamethasone to control And here, we have a variation on the two-sample plot that we did previously Oh yes, create the visualization comparing expressions So this is an exercise How do you run this two-sample plot SE on this thing? Well, you would put airway as your summarized experiment The stratvar is dex And the row name is This Ensemble Gene Oops Something else is on the clipboard here OK, let’s run There OK, so the paper was about the fact that dexamethasone seems to knock down this CRISPLD2 gene, and we’re able to see that in the raw data doing something like this It’s the same sort of thing we did with the expression set We need to pull out the assay results We need to find the piece of the call data that corresponds to the samples that we want to stratify And then, we put some labels on them, and that’s that You get a box plot So we’ll leave that there Yes? AUDIENCE: I had a question SPEAKER 1: Yeah AUDIENCE: I noticed that the erma set you showed us before is derived from summarized experiment And I was wondering if– SPEAKER 1: Genomic files, I think, no? AUDIENCE: Hm? SPEAKER 1: I thought it was genomic files instance AUDIENCE: The erma set? SPEAKER 1: Yeah AUDIENCE: I don’t know I just did is– well, maybe you could show us how to– SPEAKER 1: Sure Let’s take a look AUDIENCE: –explore this SPEAKER 1: Yup AUDIENCE: I was just wondering how that affects the user SPEAKER 1: Good question So ER set here– oops I have to plus this There So ER set is a erma set object And we can learn more about it by running a getClass erma Set And getClass erma set tells me all the different slots of the class We’re not getting into that But I think what you’re seeing here is that genomic files actually extends these summarized experiment I don’t quite understand that But the class relationship is like this, and there must be a good reason for it There’s some sharing of code that is ultimately helpful for genomic files But I wasn’t thinking in those terms You must know AUDIENCE: You want the mic? AUDIENCE: Well, the one of the major classes that Bioconductor uses summarize the experiments And then, a lot of them just extend that So that I think genomic files makes use of summarized experiment as an extension AUDIENCE: Obviously, you’ve got the call data SPEAKER 1: Yeah, maybe just to get the call data or something like that, yeah You wind up extending the class All right, those are minutiae So we’ve gone through reading and counting and visualizing for these data So let’s wrap up This was all about IRanges IRanges represent intervals GRanges put those intervals in a genomic context You know lots of operations about them The GWAS catalog can be represented as a GRanges And findOverlaps can be used to look at different things to see when they coincide Different types of things on the same genome can be looked at together using findOverlaps Then, genomic files manages collections of BED or similar files We went through a lot of stuff with that, specifically that you can bind a region of interest to a genomic files instance And then, your operations just pertain to that region of interest summarizeOverlaps will be used to get read counts in regions of interest And finally, SummarizedExperiment and RangedSummarizedExperiment coordinates information about all these different things– samples, features, and assay results And it works as we’ve shown a number of times Questions, comments? So we’re going to move on now to annotation

So save out your Notebook And let’s move out to Newton, period three And remember to save a version of the Notebook with your name So we’ll let everybody make their change, and then we’ll go through the roadmap here Well, who cares about genomic annotation? I think everybody must And Bioconductor started out with three basic things Pre-processing– because stuff was coming off microarray machines, and it needed a lot of work to make it actually interpretable Annotation– because once you did the analysis of the properly normalized microarray, you had to figure out what was going on In particular, you had these probes which were interrogating genes, but you needed to put the genes together, put them in context, link them to gene ontology, put them in terms of pathways and so on That was the second thing Pre-processing, annotation, and analysis– so how do you actually do the analysis fit linear models and so on– those were the three things going on in Bioconductor And now, annotation has become much more complicated We need to have the actual reference sequence, all kinds of catalogs of regions of interest, and pathways And there’s no consensus about what the right way to do this is, and there are different things for all different kinds of organisms Well, we need to simplify the interface to all these things And there’s a system of packages called OrgDB There’s reference sequence that we want to be able to work with There are different builds of the human genome that we may need to have simultaneous access to There are all these transcript and [? exime ?] catalogs from different institutions for many different model organisms And then, there is finding and managing gene sets These are topics we’re going to cover here We’ll talk a little bit about ontology and round trips, getting data in and putting it out, and finally, this thing called the annotation hub, which is pretty amazing but is somewhat complicated as well So our hierarchy, as I’ve said already– genome sequence, regions of interest, and functional properties that may be cataloged in pathway catalogs or what have you So we run our first chunk, which is to get all these packages in space in our workspace And then, we want to do something that makes sense early on So let’s look at this Let’s think about what we will do in order to work with data on humans We’re going to bring in this package called org.Hs.eg.db And there are a lot of packages like this If you work on mouse, you can put capital M, small m On rat, you can put capital R, small n We may not have installed them all, though, so you may have to install some of them If you work on yeast, you would put [? sc. ?] But then, I think you have to put sgd here, because it’s not an entree gene catalog But it’s from the yeast database out in California And if you were doing dm, I think it would also be eg, but there might be a different institution that does the fruit fly So there’s this pattern of package names that pertains to all the major model organisms, and within each one of those, you’re going to have a bunch of fields that are recorded for each one of the genes that the database has information on So for example, Entre ID is the major key for getting into this org.Hs.eg And for any given gene, there may be some RefSeq IDs There may be ensemble gene identifiers you’d want to know about, the transcripts, the proteins that have been mapped by ensemble, gene ontology categories, and so forth So how do we work with these? Well, we use a method called Select So get into one of these packages You can set some keys to be, in this case, symbols that you’re interested in learning more about, and you can ask what types of things you want to record in columns And in this case, we took Entre ID and Ensembl for the symbols BRCA2 and ORMDL3 Anybody want to give me a gene that they’re interested in? Somebody

AUDIENCE: CFTR SPEAKER 1: CFTR, good choice Put quotes around it, and let’s hit– there we go Now, we have more information about another gene Now, how about another thing you’d like to learn about? Gene ontology– put that in the columns Now, I may have done it again Now, let’s go and run it again Now, what happened? AUDIENCE: [INAUDIBLE] SPEAKER 1: Well, look at that, yes So now, what’s going on? We originally had just three lines, but these genes have been annotated to so many gene ontology categories that the table gets very long Familiar situation? There are many different functions that seem to be annotated for BRCA2, and all of them are labeled in these gene ontology categories Now, shall we look one of these guys up? Let’s do that We need a different data set We need GO.db So I will just add in this new cell here, and we’ll do select GO.db key equals that And if I say columns equals term, I think that will work Oops Argument two– I need to say keys I guess There So that’s telling us that one of the things that BRCA2 was annotated to is DNA synthesis involved in DNA repair So is that pattern clear enough? You look up some keys in one database There may not be resolution You need to find another database to find out what these things mean And select is the thing you have to tell it what the keys are Maybe sometimes you have to tell it the key type and then what columns you want back, and you get them back So here we go again I think this is just pretty much the same process but showing us that just for BRCA2, there are many different gene ontology category and pathway combinations In this case, we’re getting pathways from KEGG, Kyoto Encyclopedia And again, our syntax is what makes it all work in the sense that if I want to filter down to those gene ontology attributions that are traceable author statement, then I use R to go and take only the rows of this table that are traceable author statement attributions So they’re– AUDIENCE: Is that because you can’t use the– what is it– the path as a key for the evidence [INAUDIBLE] that you had? SPEAKER 1: That’s a good question Can you use the evidence as a key here? If I use the evidence as a key, I’d get every gene that had a given evidence type I’m not sure it’s among this So that’s a good question Let’s see You can actually run key types here of org.Hs.eg.db What are the legitimate things to ask for in terms of keys? And you could see here that evidence is one of them So if I used evidence as a key, I would get back lots of genes that have been added in a certain way AUDIENCE: How is this different from using something like BioMart? Which one is more up to date in terms of [INAUDIBLE]?? SPEAKER 1: It’s a very good question I don’t know that BioMart has a much longer life to live But assuming that it does, the only issue with BioMart is that it’s constantly changing, whereas we’ve derived a lot of annotation from BioMart and versioned it So you know what you get when you use a given version But you’re welcome to use BioMart We didn’t do any exercises in that, but there is a BioMart package And it works fine to get those types of annotations AUDIENCE: [INAUDIBLE] SPEAKER 1: All right So using GO.db and then using KEGG through the RESTful interface– so here we are with our KEGG Rest package KEGG Rest is a package that will go out on the internet and execute commands like keggGET You give it the name of a pathway in its own terminology It’s one of the things that BRCA2 was annotated to And it comes back with a list We can get elements off that list In this case, the first two elements are the entry, given KEGG name, and the name, which is homologous with combination,

and then a list of genes that are in that pathway In this case, there are 41 So you have the Entre IDs and you have the names of the genes And finally, with respect to KEGG, you can also get a picture of the pathway that you’re thinking of here So this is a simple raster image that describes this pathways in cancer pathway And it’s a bunch of genes connected to one another and different modules that you can drill down on if you learn a lot about KEGG So this is very straightforward stuff I wanted to put it in as sort of the taster So if you’re interested in thinking about genes in the human or any other organism, these are the patterns you’re going to use The org package will tell you about the basic gene to function annotations And then, things like KEGG or GO.db can be used in the way I’ve described So now, we will talk briefly about the Ensembl resources for humans It’s a different package, a different set of packages And it has annotation that I think more people are more enthused about than the NCBI annotation You have to bring the libraries into scope, and then you can run code, like select, with certain keys and certain column names And you get back what Ensembl has to tell you about these things, in particular the positions, sequence name, and then you can get the protein sequence that they’ve recorded when it’s available Now, what are the organisms for which we have packages that describe the full genomic sequence? There’s a nice package called BS genome And it has a function called available.genomes And if I go and run that function and then make a data table, we then see that there are 87 entries here And you can search to find your organism of interest We’ve already had someone ask about fruit flies, so there you go You can search and find that there are multiple builds of the fruit fly genome available to you in terms of the full genomic sequence through the BSgenome package Now, I always make a big deal about the fact that the build version is very important, and it is definitely possible to combine information from one build with information from another one without knowing about it And a lot of us are still working with build 37 or hg19, which is now eight years old, even though Build 38 is five years old I don’t know when the next one is scheduled What I really think is important are two things First, when you have some data, mark it, stamp it with information about the build that it came from We try to do that when we deliver data But you may create data, and you have to do it yourself The other thing to know is that you can use liftOver with NR if you need to translate regions from one reference to another So there are these things called chain files, and we have a workflow on Bioconductor to show you how to use it But in essence, you take a GRanges You run a liftOver on it You get the GRanges that map to the new build or map from a new build to an old build using that utility So just to give you a taste for the actual genomic sequence for hg19, we have this code here This package is 800 megabytes if you bring it in, so it is a bit of a wait to install But once you get it, you can get, for example, the entire sequence of chromosome 17 very conveniently And really, 81 million bases are there for you in a very compact representation And then, you can take GRanges to go and pull out the sequences for the different genes if you’re interested doing that But usually, people do not need the sequence for what they want to do with RNA-Seq

The aligners take care of that And they really just want to work with the addresses of known genomic elements And that’s where these TxDb packages come in I would like to know all the exons and all the transcripts and so on that are available in a given build And the metadata about these TxDb objects is pretty extensive We’re getting them from UCSC, the genomes hg19, and the date and time at which it was done is all recorded for you And then, as we’ve done before, you can run a genes method on this and get back a GRanges that has the intervals for each one of the genes And these are the Entre gene IDs You can use filters on these operations, like exons and genes and so forth In this case, what I’ve done is I’ve simply said, I want to know the exon ID, of course the ranges, the transcript name, and the gene ID for two genes, which have these Entre IDS And the data come back As I noted to you before, we really work with GRanges lists mostly when we are working with genes and their exons And so you can split– you can use the split operator, which is familiar from [? BaseR ?] to split a GRanges ranges into multiple components In this case, all the exons from each gene are segregated into different list values So that’s all about genomic sequence Now, we can get into gene sets Any questions? I think we’re getting near break time Let’s go and talk about gene sets, and then we’ll take a break So the first thing we looked at was this KEGG call to KEGG REST We looked for a certain pathway, which is about homologous recombination And there’s a great deal of information that comes back when we make that KEGG REST call telling us all about the modules from which that diagram is composed So homologous recombination has modules like RPA complex and MRN complex and so on, and you can learn more about those by interrogating KEGG with other tokens Then, diseases in which this pathway is implicated include these here So this is just a big list that you have to figure out how to work with if you are interested in diving into there I don’t think we have many high-level functions that operate on these returned objects yet There’s also a bunch of literature references that come back So that’s one form of gene set that you may be interested in The genes that come back are in the component gene So here, there were 82– it’s a vector of length 82 And it’s a vector of really 41 pairs, gene and then some character annotation about each gene This is the Entre gene ID and then some verbal annotation So you would take your look one that comes back from KEGG REST, get the gene element of the list, and that would give you information about the Entre IDs We want to talk about something that I hope is familiar to some of you, which is this MSigDB, Molecular Signatures Database, which is managed at the Broad Institute And they have a special format for representing gene sets with a GMT suffix And we’ll run this code here There’s a package called GSEABase And we have created a GMT extract, which is a set of pathways, or gene sets, really, annotated to glioblastoma So we’ll run this little chunk here And we get back this gene set collection So the Broad hasn’t just given you a gene set of related to glioblastoma It’s given you a bunch of gene sets which are published in different papers And the author of the paper and some characteristics of these gene sets is given in the names

And here’s a little table of the names And so here again, we have a high-level object, [? glioG, ?] which is a gene set collection We have a names operator on it, and we have the ability to take the length of the gene IDs in each one of these sets So there are 47 sets, and the name of each one of these sets is given here And the size of the set is given here So we’ll make it a little smaller, so we can see both once Well, that’s not so good But you get the idea And now, we want to understand, how redundant are these gene sets? Well, we can use the Intersect function to take two of these sets, any two of them, and ask whether there are genes that recur in these two sets And in this case, there were four So here’s a question, an exercise, to work with this if you’ve gotten to this point How many genes are shared between the TCGA copy number up and classical gene set identified by [INAUDIBLE]?? How would you do that? Well, you come back to this list of names of gene sets and see if you can find what I’m talking about here What we see here is there’s a TCGA glioblastoma copy number up That’s what I was asking about, copy number up So that’s the 40th gene set Then there’s another one by [INAUDIBLE] that is called classical Let’s go to the next one here [INAUDIBLE] classical is the 42nd one So I think we’re looking at the 40th and the 42nd We take the intersection, and we find how many genes are shared between these two gene sets Is that clear? You have an object, which is a gene set collection You can use double bracket to get elements of gene set collection, which are gene sets themselves There’s a gene IDs operation on a gene set And you take an intersection of two vectors of gene IDs, and you get the common genes AUDIENCE: [INAUDIBLE] gene is [INAUDIBLE] SPEAKER 1: The full structure of the GSEABase gene set class lets us have lots of metadata, and that comes out of a Details function So the 44th gene set is the neural type glioblastoma gene set due to [INAUDIBLE],, 129 genes, and we can add the PubMed ID for this paper to the data set– to the gene set And then, it’s a little richer We can also use our PubMed ID to Miami with that PubMed ID and get the abstract for that paper So very nice techniques to go and query PubMed to learn about papers related to gene sets in this case Now, I want to wrap up with the translation between identifier types, so here we go The gene ID type of this gene set can be set to be symbol identifier And that enables us to say that these are of gene ID type symbol If I then change the gene ID type to be an annotation identifier, I will instead use the Entre IDs And what you’ll find is that when we do this translation, 129 genes that were given as symbols reduce to 122 genes that have Entre identifiers And the exercise was to try to understand why that was And I don’t give the answer there Notice that this change in annotation type leads to a different number of genes in the gene set And the reason is very simple It’s just because some of the symbols that they use don’t have Entre IDs, and so we just lose them

All right, that’s somewhat technical But if you’re interested in gene set enrichment analysis and so on, having this technology under your belt, there’s not a lot of commands that you have to know It’s just more knowing the potential of it, I think, that is really relevant here And I’m sure that when Levi does the statistical analysis next time, there will be some examples of gene set enrichment analysis that may make use of this infrastructure Now, what about ontologies? I just want to quickly take us through this idea of the various ontologies that are available for use in Bioconductor There’s a package called ontoProc It gives us access to, for example, the cell ontology or the cell line ontology And there are visualization tools in a package called Ontology Plot that enable you to think about relationships between cell types For example, if we look up gabaergic neuron and glutamatergic neurons in this ontology, we may want to know, what are the higher level categories within which these cell types are found? And this package helps us to produce displays of this type Now, this coming back to– [PHONE RINGING] –sorry about that It’s undoubtedly one of those credit card– [LAUGHTER] So we had the question, how do you get data out of Bioconductor into some other form? And this is a quick example of doing that So let’s say we bring in some– we have some data out there in a BED file It looks like this In this case, it’s a narrow peak file So tab to Limited You read it into R. You have this as a bunch of strings And then, you use R trap layer to make something that makes sense out of it So if I import that file with our Track layer, I get all this wonderful structure now It’s a GRanges, and there are different fields And that’s great You have to know about the BED graph format in order to deal with that, that you can look up easily I think I have a link in there But what’s important is, well, what if I want to go back out? What if I want to produce this data in another format or just make a new BED file after I filtered it somehow? Then, this code will take care of that for you I set up a temporary directory, and then I make a file name for that demo extract, demo export And then, I run the Export command from this GRanges out to my temporary file, and then I read it back in And so you can see that export/import can be used together to get data After you’ve done some processing in R, you can throw it back out to disk very easily And then, we will wrap this up with a mention of AnnotationHub, a very cool package that includes tens of thousands of files, GFF FASTA files for different genomes and so on, that are sitting out on a server that Bioconductor takes care of And if you’re interested in getting some of them, you can simply use the double bracket notation So we get an annotation hub instance by calling AnnotationHub here We get metadata about it by mentioning that object And then, we can extract information Let’s take AH2 What do I mean by that? So we’ll just do it here, AH2 With that, I have requested that Bioconductor go and download from the AnnotationHub the information in AH2, which is some kind of DNA information about this organism And I hope it doesn’t have such a big genome, but we are going to be pulling down the FASTA files for its genome Maybe I should do interrupt here Well, I don’t know

Will it run in the background? No Should I interrupt? AUDIENCE: Do you know how big it is? SPEAKER 1: We’ll just interrupt Download failed No problem Yes, this is actually a slow-running command To get the metadata about all 47,000– 43,000 entities there is very substantial And we can go and query that in terms of the data classes that are available in AnnotationHub And here’s what it turns out they are There are 10,000 big-wig files that are information about the encode projects epigenomic studies There are [INAUDIBLE],, which I suppose have to do with sequence annotation, two-bit files, and VCF files And I thought we would finish up with one query to a VCF file How many people work on human genetics, work on human variance? So you know about VCF files, right? We will go and find out about these VCF files by running this data table command here These eight VCF files are from ClinVar And there are different subtypes There is VCF, papu, and so forth And you can learn about the variations here in the metadata What I wanted to do was extract one of these guys from AnnotationHub And in my case, it runs very quickly, because I already did it before And we can look at the header So each VCF file has a header, and this is telling me that there are no samples recorded in this file But there is a lot of information, and that’s because ClinVar isn’t recording information about individuals It’s just recording information about variance And the meta component tells me all the different things about this particular instance of the ClinVar data There, that’s easier to read So it tells me it’s a VCF version four, what date it was made, what’s the reference genome, and something about the variation property documentation for this particular ClinVar And then, we can use GRanges to go and take a piece out of that file So here we are, another GRanges I want to look on chromosome 1 between bases 949,000 and 996,000 I’m running Read VCF on this downloaded entity And what I get back is information that tells me that there were 42 snips in this interval that have information on many different features, all of the things that ClinVar has reported on, including things about the disease database, the disease name, and so on that ClinVar knows about these snips So if you know a region of interest, you bind it in to your query through the parameter, read the VCF You’re only going to get that small piece of the VCF and analyze it using variant annotation in order to get information about those variants So I want to wrap up and then take a break Levels of annotation; reference sequence; annotation databases; available genomes; getting genomic sequence and using the catalogs, like ensemble or UCSC known gene; getting gene sets; pathways; and finally, doing some imports of BED and exporting; and, finally, working with AnnotationHub to get all kinds of data Could be referenced genomic sequence, could be ClinVar, and we’ve shown how to do each one of those things So that’s the scope of annotation It’s very broad Hardly anybody uses all of it But you’re going to use some of it, and I hope that you find this a useful introduction to that I think we should take 15 minutes and maybe meet at 2:25, unless– is there a coffee break formally made? AUDIENCE: Well, we just kind of play it by ear

There’s coffee out here SPEAKER 1: There’s still coffee there Yeah, OK So let’s take a break until 2:25 or so AUDIENCE: One quick question, though So just to follow up on this, usually I know that this is something that everybody will use at the end of the experiments to look at the data and kind of can make some sense out of the data That’s how it works out So a new experience, like in general usage, what sort of– can people do it on their laptops? Or how much resources does it take? Or is it going to be like– just, can you give us– SPEAKER 1: Oh, well, everything I’ve used is this laptop AUDIENCE: OK SPEAKER 1: So yeah, the general annotation requirements for things like the human genome, the genome sequence itself is a gigabyte when it’s been properly compressed, maybe less I mean, we send it out in 800 megabytes So you have to prepare for that if you’re going to go to the base level And then, yeah So here’s a question Suppose you want to work with 1,000 genomes data at the variant level, so you get these VCFs They’re multiple gigabytes We didn’t deal with that It’s sitting, of course, in S3 Amazon S3 has all this VCF that has been Tabix indexed So we can actually query that directly using R And I would just do it that way I would never have any other features as here AUDIENCE: [INAUDIBLE] SPEAKER 1: Yes, that is correct Use the GRanges– [SNAPS] –and you get back that Yes, that’s right So that’s a strategy for dealing with large numbers of VCF files You Tabix index them You put them in some place Nobody has to copy them anymore, but all the queries can be resolved Yes? SPEAKER 2: Just to brag a little more about [INAUDIBLE],, that 1,000 genomes thing is amazing, because there’s a [INAUDIBLE] on the internet You can have this local object with the GRanges You use findOverlaps, just like we learned, and use these exact same things to fetch just the [INAUDIBLE] overlaps It’s super easy [INAUDIBLE] SPEAKER 1: There’s a lot of people who contribute to the ability to have that infrastructure resolved at that level But yeah, it’s definitely worth trying out and recognizing the scale of work you can do just using the tools that are sitting on our laptops, yeah So any other questions Yes? AUDIENCE: Is there a rule of thumb for– I guess when we’re running some of the code, there are some error messages or warning messages I guess, what are your recommendations if we see any? Thank you SPEAKER 1: Don’t worry about it [LAUGHTER] OK, so it’s a great question, and it brings me to a topic that’s dear to my heart, which is debugging So in some cases, you’re going to hit an error, because something is really wrong, and you didn’t put the right argument in that’s needed And eventually, the code trips at some place, and it’s not really clear what’s going on At that point you can run the debugger on the function that failed, and it will step through and help you to find what part of the function threw the error So that’s an important thing to know about– I can demonstrate that if you really want to see that AUDIENCE: Have you tried it in Jupyter? SPEAKER 1: No, Jupyter I don’t think you can run a debugger in You’d have to do it in the command line interpreter or RStudio I guess you could do it in With respect to warnings, warnings are always important You always need to understand why they are happening Oftentimes in what we’re doing here, it’s because we’re being very conservative We’re saying there’s something going on while I’m loading a package That package might conflict with some other package, so I’ll warn you I’ll let you know that it’s happening Those sorts of things, you dismiss, you don’t worry about it And you can suppress them if you want So I would say always be alert to warnings Sometimes, they are a real nuisance, and you know they’re not a problem You can suppress them, or you can set options warn equals zero, which will just say ignore them all So you have ways of dealing with it But it’s a really important technique to have in working with R, understanding exceptions, understanding errors, and so on If there’s really something going on with these codes, please let me know the details, and we’ll come and work that out These codes have been tested So it’s just because I’m skipping through things that there may be some computation that isn’t found, and we try it prematurely But for the most part, these exercises do not have any meaningful errors OK? All right, so you’ve won another few minutes We can meet at 2:30 [SIDE CONVERSATIONS] Thank you very much for coming back again 2:34– I think we’re going to wind up a little early today, and I hope you don’t mind [LAUGHTER] First, a quick question– because of the way I’ve developed this material, [? Ashok ?] was clear that he wanted to spend some time on visualization

And I think visualization is a wonderful topic to deal with in working with R and working with Bioconductor For some reason, I felt that I should introduce this in terms of the concept of what you’re really trying to do with visualization, which is to show something about the structure of variation And thinking systematically about variation is fundamentally involved with statistics and probability, thinking systematically about variation And so I want to know, how many of you have had a course in statistics? Just about everybody That’s correct How about probability? Quite a few– OK, that’s good So you you’ll recognize some of the things that I thought should be dwelt on a little bit in thinking about visualization We’ll start with this little package setup here And then, I felt that we should do a little bit of formalism about what we’re going to do to interpret our graphs of different kinds And so the concept of a density function seemed really important to get clear on and just to introduce the Gaussian density function So the idea is we can really make sense of things straightforwardly when we have N data points that will denote x sub i and they are statistically independent If they are not independent, we’re going to have to do more intricate things with respect to the structure variation, and it is not always straightforward to do visualization But I’m going to assume that they’re statistically independent in what we’re doing here And their relative frequencies are given by functions of this type This is the Gaussian density, and this is a Gaussian cumulative distribution function I didn’t want to get into definitions of random variables and such things, but if you’ve had courses in probability and statistics, then it should be fine So this is a view of this probability distribution function So for a standard Gaussian variant, the probability that the value is less than zero turns out to be 0.5 and so on That is a familiar formalism to all of you And the density function can also be looked at here So that is a Gaussian density AUDIENCE: We’ve got dead kernels over here SPEAKER 1: Dead kernels [LAUGHTER, CHATTER] SPEAKER 1: Is [? Ashok ?] still here? AUDIENCE: Yes [LAUGHTER] [SIDE CONVERSATIONS] SPEAKER 1: Sorry about your kernels Well, why don’t we just not worry about the kernels right now and look at this code? Because some of you may be interested in understanding this code a little bit better Remember, I wrote a little formula there, the Gaussian density And it’s somewhat forbidding even now, after studying it for years, to think about pi and e and so on And I thought it was kind of nice that you can just write that function in R. Square root of 2 pi Take the reciprocal, and multiply by the exponential of minus t squared over 2 And then, we want to plot this function, which is an integral and a bunch of values of x And so the way we do that is by running– we write this function which actually integrates that function from minus infinity to x, just what we said here It’s kind of remarkable that you can do that in R. You have a function, which is this density function, and we’re going to integrate it up to the different values Well, what is x going to be? Well, I’m going to pick the interval minus 3 to 3 and take the steps of 0.1 And then what I do is I plot this interval as the x-axis, and I apply over this x-axis the function myf, which is to integrate And that function can be integrated for all those points very rapidly It gets done interactively right here So this is just a way of being very concrete about the things that are quite abstract here And that seems, to me, very nice from a learning point of view to say, well, I can program this in R And I can create these values of the cumulative distribution function with this small amount of code, which is doing these integrals over and over again for different limits It’s not the most efficient thing in the world, but it works And it’s very elementary Now, the density function is just plotting over the domain that I wrote there the function myf

This is a vector This is a function that can be evaluated in a vector to give the values of the density function over this domain So Those are just ways of getting a handle on aspects of variation that are extremely elementary There are no data that follow this distribution But this distribution can be used to describe the distributions of statistics when we do certain sampling or aggregation activities with them And this just shows that the built-in function in R called denorm has exactly the same values as my handwritten version of it over this domain So that was nice to see I’ll run that there, and you’ll see that the range is 0 to 0 Now, what about Q-Q plots? Everybody familiar with that? It’s a way of assessing whether a distributional model is appropriate for a given set of data So the idea is the theoretical quantiles, values of the variable that I’m thinking about can be compared to the sample quantiles And if the plot of these sampled points is close to the straight line, then this sample can be thought of as being approximately normal 0, 1 That’s the interpretation of this q-q, plot for the data that have been sampled We didn’t talk much about simulation We didn’t talk about it at all But you can simulate data in R using the R functions R-norm, R-chi square, et cetera, et cetera, et cetera are there to help you simulate data And then, we can check whether the data that had been simulated are approximately normal using a Q-Q plot All right, these are things that should be familiar enough to people who are doing bioinformatics and have had some statistics How are our kernels doing? AUDIENCE: Well, one’s set up to run to [INAUDIBLE] SPEAKER 1: One is running AUDIENCE: It might be [INAUDIBLE] AUDIENCE: Do we need to re-log in? AUDIENCE: Just go to your kernel on the– dedicated kernel [INAUDIBLE] SPEAKER 1: Anybody got a working kernel yet? Yes? AUDIENCE: Yeah, it seems to work SPEAKER 1: OK, we have it over there AUDIENCE: –shutting down [INAUDIBLE] [SIDE CONVERSATIONS] SPEAKER 1: All right, let’s just keep rolling a little bit to talk about some more concepts, and then we’ll go back to some computations So the Iris data of Fisher is often used to illustrate aspects of multivariate statistics It’s a data set that you have in R, and it’s a collection of measurements of different features of the flowers of the iris And there are different species that have been measured I think there’s 50 observations on each one of three species And the visualization we’re going to start out with here is called a histogram And I’m just looking at one of these features, the petal width, and we see with the histogram that it is at least a bimodal distribution There’s lots of values near 0, lots of values in the vicinity of 1.2, and maybe there’s another mode here The histogram can be tuned You can have wider or narrower intervals And how do we find out about that? How can we tune our histogram? Whoops Just say, ?hist, and you start to get information about histograms And here, the breaks argument can tell us how many cells we want to use So let’s make this a more fine-grained histogram Looks like we’ve got about 12 cells there Let’s do 20 And now, you can see that this histogram

has much less smoothness to it We’re seeing more bumps in this interval It’s not so clear So what we’ve done here is we’ve gotten a higher level of resolution, but we’ve lost the aggregation that we got with the default histogram bin width So that’s a tunable histogram And I’ve superimposed on this default density estimate that is another way of thinking about the variation in the data Now, I want to skip this optional information, but we will come back to the Iris data in terms of a multivariate data set, where we started looking at variation between these features, in a little while This is a classical way of looking at univariate variation in a continuous valued variable And it’s useful, for example, for gene expression measured with microarrays, because we are getting real value measurements on expression in a microarray In an experiment like the airway data, one could say that it’s discrete data and that you would not necessarily use this type of visualization, because it is discrete counted data It may still be useful Let’s give it a shot So you’re Bioconductor experts now You can say [? library(airway) ?] data(airway), And we can do a hist of (assay)(airway)1 So we’re going to look at the first gene in airway And I’ll get rid of the rest of this stuff for now AUDIENCE: Just to interrupt, [INAUDIBLE] make sure you save the [INAUDIBLE] SPEAKER 1: Oh yeah, forgot about that So here is a histogram of the first gene And there are only eight observations in the airway data set, and you can see that it’s not all that informative What if we look at the distribution of counts over samples, over genes in one sample? How would we do that? Just put that over there No good, because the range is too large So take a log And there may be zeros, so add 1 This is just seat-of-the-pants computation to visualize the data in a given sample Assay unexpected– so there we are Now, we are visualizing all the genes for a given sample and seeing that the log distribution is such that there must be a great many genes that have value 0 and a relatively modest number of genes that have positive counts going up to e to the 10 So R will help you get at your genomics data as long as you know the syntax and know how to dig the data out Any questions about using histograms to visualize continuous data? Very straightforward, right? OK The optional stuff here is to look at more intricate forms of density estimation I will get rid of that other stuff, and I’ll just show you one problem with the density estimate that we got out of the Iris data, which is that you see this density estimate here remains positive as we go towards zero And it’s not possible to have a negative width So you want to have a density estimator that knows that there are constraints on the data And the stuff that I’ve commented out there shows how to use a special type of density estimator that takes care of that Now, I want to turn to a completely different problem, which is visualization of discrete data And there’s a nice little data set about the correlation between hair color and eye color stratified by gender And the source of the data is given in the man page for this data set And what I wanted to do was collapse over the genders to begin with and ask, what can we do to visualize this? What can we learn about variation

in these categorical measurements? And there is a discipline in statistics called categorical data analysis or log-linear modeling, and there’s a technique for doing this The goodness of fit of an independence model, which asserts that eye color is uninformative about hair color, can be assessed using Pearson’s chi-squared statistic And we have a way of visualizing whether that is a good model or not, and this is the method of visualizing It’s called a mosaic plot or a mosaic plot with residuals only, and the idea here is that there are some categories, let’s say, where the model seems to fit well The probabilities of the events that we see, brown hair and brown eye, seems to be consistent with the product of the marginal probabilities But there are other events, like black and brown together, occur much more frequently than you would expect if the two traits were independent That is also the case for blonde hair and blue eyes– much more frequently occurring than if these two traits were independent So this is a way of visualizing information related to the hypothesis and, in fact, contrary to the hypothesis that eye color and hair color are genetically independent If you only saw gray boxes that weren’t too big, then you would be more comfortable with the independence hypothesis Since you see these large boxes, we would think that this is not the right hypothesis We would say there’s evidence against it That’s A two-dimensional problem It’s just eye versus hair We can also look at the three-dimensional problem, which takes into account the gender of individuals And this is called a mosaic plot, because it really does break the data up and allow you to see the relative frequencies of the different combinations And again, because there are exceptions to the small residual from the independence model– specifically, let’s say, here– this would be blonde with brown hair– blonde with brown eyes– very rare for both males and females That’s an exception to the independence hypothesis So I thought it’d be very useful, especially for people who are doing genetics of discrete traits, to have some visualizations for discrete data as well Any questions or comments about that? The man page on these things is very good, so you can learn more about it there I’ve also given some links to the Wikipedia pages on categorical data analysis I think it is very much worth doing Now, go back to the Iris data and think about how we would do distribution modeling for these continuous traits that may vary between seasons So here’s sepal length These box plots show us the medians and the ranges of the data, and the first and third quartiles, approximately, for the different species So you can see here the clear relationship between species and sepal length And perhaps if there’s some genetic relationship that you could find varying between these, an additive model might work Sepal width doesn’t quite have the same ordering So maybe it’s a trait that works in a different manner Well, thank you Am I failing again? We can plug it in here Thank you SPEAKER 2: I thought maybe this would fit in there SPEAKER 1: Yeah Thank you That’s good So these are called margin distribution visualizations– marginal by species and also marginal by feature that I’m looking at And we’d like to look at this in terms of joint distributions How do these features covary? And before I get into that, what’s being lost here? The information is being summarized into about five different points for each one of these samples And if you want to look at it at a higher level of resolution, you can use a plotting algorithm called beeswarm So now, you can actually see the actual observations here And what you also see here is that there’s a bit of a problem with the species labels And I’m not going to dwell on this, but I did have the solution here to tilt the labels And so you can run this thing one more time here,

beeswarm with axis=false And then, we’re going to embellish it Let’s run this one more time here Oops, what did we do here? Oh, that’s the wrong symbol Whoops, axis is not a symbol that you can use there, so some other– I think it’s axes that you have to put there There we go Axes are equal to false And now, I’m going to put something in to take care of the labels and put them at an angle I think that should work There we go Now, I’m not going to expect you to copy this I just want to give you some sense of how this works You turn off the axes when you do the original plot You then put one of the axes on that you know is going to be OK In this case, it’s axis two And then, you must do something to the text and specify that it’s going on the bottom axis, and you are rotating these labels by 45 degrees And that’s what enables them to all be plotted here So there’s some very technical stuff that can be done when you have a problem with default plotting and it’s in the code for the box plots Let’s go back to the plain old scatterplot here So this is marginalizing over all the species, plotting sepal length against sepal width But I’ve introduced this thing called jittering, because there are so many ties in the data that there’s actually a lot of overplotting if you don’t jitter one of the variables So now you can see when these variables are close to another It’s just a little random perturbation to each one of the x points– x-values for these points And now, we’re going to put some more embellishments on this plot So now, I can distinguish the different species by colors I’ve used jittering so that there isn’t severe overplotting And I can see now that there’s some structure in the variation When I plot sepal length against sepal width, one of these species seems to be quite well separated from the other two And that’s the kind of thing we’re trying to expose in our visualizations So let’s look at the code for that one more time You can ignore this This just has to do with Jupyter I’m using plotting– I’m using a formula here, where I take a variable, sepal.length; jitter, sepal.width; and use the Iris data as the source of these variables I use PCH=19 to get myself a solid, dot, and the color is just the species, which is also coming out of data Iris All right, that gives me the point and the axes And then, I’m going to put a legend on it I have to figure out where the x and y-coordinates of the legend should be The colors are 1 to 3, and the legend is going to be levels of the species factor And I will not put a box around it So this is telling me that these are setosa and so on Is that plot clear Now, here’s a nice way of looking at all the features together It’s called a Paris plot And you now see that when I plot petal length against any of the other features, the setosa tend to be very well-separated The only problem here, it’s kind of hard to put a legend in one of these plots, these Paris plots I’m sure Marcel has a good way of doing it, but You could figure out where this is, for example, and then you could put those– that legend there But I’m not going to get into that right now We’ll show some other ways of using ggplot make nice visualizations of this type that I think may be more effective Now, another nice way of looking at multivariate data is with the Cluster package And I’m not going to go into any of the details here, but I just wanted people to have this potential in mind A default method of doing cluster analysis

will take a representation of the data, define the centroids of different clusters that are defined only by the features in the data, and allow you to visualize them in this way These turn out to be the first two principal components in the default plot And I’ve set some options so that you can see the actual species using the color of the visualization And one of the strange things here is this member of the green species that is right in the middle of a different cluster And you wonder, then, whether there may have been a mislabeling of one of these flowers That’s something one can spend some time on in another context This call is really very simple You have the Iris data You’re using the Cluster package You use something called partitioning around metoids with the three clusters You have to tell it the number of clusters And then, you just use a plot So this is the kind of thing that is used all the time with dimension reduction for genomic state, and I’m sure we’ll have examples when the next course comes around So those are just a couple of ways of using what’s really base graphics in R with different sets of options and so forth to do scatterplots and so forth The grammar of graphics, ggplot2, is a much more sophisticated software tool that produces correspondingly more sophisticated visualizations And to get going to justify this, let’s go ahead and just try it out We will use a ggplot here to develop a view of the histograms by species So here we are We run a ggplot command We’ll start with the Iris data set And we’ll tell it, through this thing called the aesthetics binding, that we’ll use sepal.length as the x-variable And we’ll build histograms as the geometric representation of variation in sepal length, and we’ll facet this visualization by species So these are the different facets, and the histograms just count up numbers of observations in different bins Very nice single call that builds up this way of representing variation And it take care of many things for you For example, the x-axis will be common across all the facets The y-axis will be coming across the facets You can free them up if you want You add additional options to the facet Now, here’s another very nice display to look at what you might call two-dimensional histograms So now, we’re looking at petal length versus sepal length; again, stratifying by species; and counting up and showing with a color how many observations fall in each one of these bins So you can say that for setosa, for this bivariate distribution, it’s much more concentrated in a small region of this plane than for the other two species So you’re visualizing variation, again very conveniently just setting up the x and the y bindings through the ggplot command and then saying I want to do this two-dimensional binning of the data with a facet on species So there’s an exercise here Let’s see what this display looks like This is very nice So instead of using two-dimensional binning and counting, we’re going to compute density estimates– bivariate density estimates And these are contour plots on topographic maps that show that the concentration of the sepal length– petal length bivariate distribution is, in a certain region there, slightly different for virginica and very different for setosa And I’m sure you can imagine many places where you’d like to have this type of visualization for bivariate distributions The data for setosa plants seem peculiar Now, what’s going on there?

We have to turn this into markdown There are at least five data points outside the extreme data contours So I thought that was interesting When you look at this, you see that the same density algorithm is being used for each one of these There is one point here that’s completely out, but for this one, there are five And the question is, can you write the code that will zoom in to see what’s going on with these data? So we modify this code to zoom in So let’s try it What’s the first thing we can do here to zoom in? What should I put here? Don’t all shout out at once [LAUGHTER] OK, so we’ve done that What did that do? That’s a new piece of R called subset And what are the dimensions of IR2 now? Whoops, sorry about that What are the dimensions here? It’s a 50-by-5 So now, we’re going to keep going We’re going to try and embellish our plot so that we are working with IR2, and we’re going to do a [INAUDIBLE] density 2D So let’s see if that works Well, it seems to have succeeded And now, I need to mention this to the interpreter– this is G1– and it will plot it out There So something is missing What do I need to do here to see those points that are outliers? We haven’t lectured about this, but if you’ve seen ggplot, you will know that you can just take a plot and add some additional parameters to it to embellish it And now, we can see that there are quite a few points that don’t lie within these standard density contours, and it will be interesting to know what’s going on with them And how to locate them and so on is a matter of interactive graphics that I’m not going to get into right now, but it can be done Yes? AUDIENCE: I have a question Why isn’t the main argument [INAUDIBLE] defined as global aesthetics? Why are you doing the original data as opposed to just half the [INAUDIBLE]? Like why are you modifying the original data? SPEAKER 1: I’m not modifying the original data I mean, the data are coming in, and I’m applying a function to them to alter their positions So let’s just do it like this Nothing is happening to the data, but you don’t see any more of the overplotting that’s going on The overplotting, the jittering is just sort of an inline perturbation of the data so that this visualization doesn’t have the overplotting But the data are just– they’re immutable, unless you actually do something to the IR2, in this case Any other questions, comments? All right Now, I want to take us to an extension of the gg, the grammar of graphics, for bioinformatics It’s called ggbio, and it works somewhat like ggplot But actually, many things have been wrapped up for you so that you can do wonderful things like this So these are all the chromosomes, all the standard chromosomes, and we have a GRanges called HepG2 which is showing where a certain transcription factor binds to the genome And what you have here is a little line drawn wherever a binding site was recorded for HepG2 Now, we have another data set called GM12878 HepG2 is a liver cell GM12878 is an immortalized B-cell And if you do the same plot for 12878, you see that there’s quite a bit more larger number of binding sites So this is just exploratory data You get a GRanges You want to see what’s going on the genome You just use ggbio You use autoplot with a GRanges

You could have different types of layouts, and that’s the carrier gram lay out The ggbio documentation needs some work, and there will be [? ggbio2, ?] and I don’t really want to go into that too much more Gene models we talked about a little bit If you start out with the gene symbol data which is coming out of a package called biovisBase, this is a mapping that’s familiar to us It’s only 29,000 genes And its origins can be found in the biovisBase package But here we are We’re going to take a small number of genes– in this case, just two genes– and subset our ranges down, call it [? oo. ?] And then, we’ll use this autoplot function on the homo.sapiens object with a which parameter, which limits us to the genes that have been listed here And then, we had set the gap.genome to be chevron That’s just another special parameter And what that means is we get a picture now of the gene models that is a little more refined than we have seen before So what’s going on here, there are isoforms of GSDNB that have skipped exons in different places And some have left off– some exons can be left off completely, and we still see isoforms So this autoplot can give you a quick view of the complexity of a gene model once you’ve got a GRanges OK, now, I want to talk about another package called [? gvis. ?] The aesthetics of the display is really what’s distinctive about this There’s a very rich infrastructure We’ll go through the code in a minute What we’re going to do here is just show how you can visualize data on transcription factor affinities in the vicinity of some gene models and show the models themselves And it takes a while to produce the display because of a lot of formatting that is going on, but there it is Now, this is a pretty nice display We start out with what’s called a genome access track, which shows us where on the chromosome we have zoomed in, so it’s on chromosome 11 And we have a data track where the binding scores for this ESRRA binding factor are shown And then we have the genes in a fairly crude form for the major isoforms in this panel So you can start to get a sense for genes that may have, let’s say, promoter regions that may be occupied by this ESRRA It’s not particularly easy It’s something you can also do perhaps with a genome browser, but this allows you to do it programmatically and to have a sense of quantities in ways that it would be a bit challenging to do with the genome browser And I wanted to conclude with what’s called a Sashimi plot The Sashimi plot is created by the Miso Project And I don’t know how familiar this is to you, but it’s basically a way of showing what’s going on with the RNA-Seq data when you may have alternative splicing going on And so we’ll go through the code here And here’s the idea that you’re seeing how reads can be found in different parts of the experiment that may skip exons In this case, this is the wild type In the knockout line, there is much less data How do we do this? Well, remember, the RNA-Seq data is coming to us in BAM files And in [? gvis, ?] there’s a nice function called alignments track that will work on these BAM files and use specific parameters to control the display of those alignments These are paired-end alignments And we’re telling them which problem to solve and what genome we’re working with

And then, we use our gene symbol data to define a GRanges that’s the interval that we want to work with And then, we use something called plot tracks to plot the various tracks here, which include the knockout track And this is showing us that the coverage here is extremely low for these exons that we can see here– extremely low, much less than 100 Whereas when you look in the wild type data, the exons can have coverage exceeding 300 And the isoform distribution can be sketched, let’s say– which I suppose you’re actually observing the data– which exons are left out of the data that we observed in this particular sample That’s the purpose of the Sashimi plot And we found that the– your computer may succeed in running this But if you only had four gigabytes on your computer, this could crash the system So this is a fairly involved visualization But it’s the kind of thing that RNA-Seq users are often very interested in doing So, we will wrap up Just a review– structure variation is the purpose of our visualization We showed how to deal with certain univariate visualization concepts, Q-Q plots for checking adequacy of a distribution [INAUDIBLE] model, histograms, densities, categorical data visualization with mosaics, and then box plots, scatter plots, dimension reduction, and the grammar of graphics So it’s a lot of material, and I thank you for hanging in there And I’ll take any questions, but I think we’re going to wrap it up [APPLAUSE] SPEAKER 3: A few things– first, again, thank you, [? Vince, ?] for taking the time and effort and putting together this [INAUDIBLE] workshop Thanks to everybody for being patient with some of the technical glitches we had during the early part of the day Thanks to [INAUDIBLE],, Marcel, who have been [INAUDIBLE] for being here to help us So first thing is, please if you do want to save your notebooks, you can do a file download as– easiest thing is to save it as HTML so that have all the code and all the [INAUDIBLE] you have generated as-is on the local machine Second, we will make these [INAUDIBLE] available to you in a few days on our website so that you can download the [INAUDIBLE] We’ll also have a [INAUDIBLE] container and [INAUDIBLE] sections on all the installed things and so on and so forth so that this kind of becomes [INAUDIBLE] that we can go back to as well We’ll eventually have the video up there as well Oh, and then the second registration just opened up for the next lecture, which is on the 19th And Levi will be presenting there And please sign up, because we’re already up to like 50% as of 2:00 PM, so [INAUDIBLE] And finally, now, I think we can– it’s only 3:30, so I’m sure that we can wait around for 20 minutes or so, so that if anybody has any additional questions– questions about the next workshop, this workshop, please hang around And then, please feel free to email me or [? [email protected] ?] if you have any further questions Again, thanks, everyone And thanks– SPEAKER 1: Yeah, I also want to put in a plug for Andrew [? Life, ?] who did a great job of setting these up Of course, there were some glitches I don’t think they’re his fault We’re working with Amazon We’re working with very complex technology So I really want to thank you, Andrew, for the effort you put in We’ve been working on this for a while You’ve taken a lot, and I’m sure it was very painful at times But if you give it some time, I think it will work into your work and will be productive for you These notebooks are also in my GitHub repo, and so if you want to raise issues or prove them in any way, you can always file an issue on those And we should coordinate If you put them in yours, then let’s make sure we are [INAUDIBLE] SPEAKER 3: [INAUDIBLE] SPEAKER 1: Or link to mine, whatever you want to do