Anomaly Detection: Algorithms, Explanations, Applications

Just another WordPress site

Anomaly Detection: Algorithms, Explanations, Applications

>> So, it’s an honor to have Tom Dietterich with us today I think I remember the first moment, I think I ran into Tom I was passing a seminar room I looked in and then walked in at Stanford Margaret Jacks Hall, and there was some sort of a very interesting debate going on Some machine learning topic I think Rich’s folks were there, and so, it’s a just a memory that I’ve had over the past I just remember that the first glimmer of the energy of Tom Dietterich I’ve come to know him and love his work over the years Tom mentioned during our fireside chat, the video we did last night together that we were talking a little bit about interests and this idea of that, I think we share one thing where we both have what we call, “Research Attention Deficit Disorder.” We get very curious and excited about many topics and it’s amazing to me how deep Tom goes on each of them that he attacks over the years Tom did his undergraduate work at Oberlin, and then, did his master’s at UIUC I think you mentioned you were following at the time the line of where you can find the best computer >> Right >> Sticking phones into these 300 bud sockets and so on in those days And then, did his graduate work at Stanford University, following I guess in [inaudible] after hanging out with Machalski at UIUC which got him into the machine learning world And has worked on many different areas I think following all the recent trends and leading some of them Remember, you and I were at the first UAI conference in 1985 together And I remember you wearing a pair of shorts and asking hard questions in front of the room back in those days I cannot forget anything Tom about these interesting interactions we’ve had over the years Going to the probabilistic graphical model phase, it’s always been touching on applications Has been quite a lot of work during the Caelo project on desktop events and understand trying to infer the intentions of goals of users from desktop streams of evidence More recently has been written like several of us here at the challenge of AI in the open world issues about robustness and understandability of these algorithms He was a leader in this whole area that’s now called Computational Sustainability, working with Carla Gomes at Cornell at defining this area and somehow convincing the National Science Foundation to fund it deeply Looking at applications of optimization and other areas of Machine Intelligence to hard challenges with environment and speciation and resource management of planetary resources, sort of resonating with our AI effort now at Microsoft Today, he’ll be talking about a piece of this interesting area on robustness, I think it’s an important aspect anomaly detection algorithms explanations and application So, Tom thanks very much for being with us >> Thank you. Is this working yet? >> Yes >> Can you hear me? >> Yes >> Okay. So of course, everything that I’m talking about today was done by someone else So, this is a collaboration with two of my fellow faculty in Computer Science, Alan Fern and Weng-Keen Wong Weng-Keen is currently rotating at NSF as a program manager but he’ll be back in the Fall And then two faculty from statistics, Sharmodeep Battacharyya and Debashis Mondal And at least one of them did their thesis at UW but I don’t remember which one And then, several graduate students and as their contributions come up, I’ll be pointing those out in the talk So, I have a sort of long agenda here, this is more than I would usually try to cover in one talk, but some of them are fairly straightforward So, what I want to do is start by defining this question of Anomaly Detection, what do we mean by that because that’s not a terribly well-defined term And then, I’ll report briefly review big bench marking effort we did Mostly, in the knowledge discovery and database community, there have been many algorithms published in the anomaly detection area I think largely because they have

many business problems in fraud detection and management, and this is a big tool there But a lot of those are just publishing an algorithm, a mechanism without any kind of a theoretical foundation And so, we wanted to understand them, and then, we’ve done a bit of work on probably approximately correct type of machine learning theory for what we call rare pattern anomaly detection Then, I’ll talk about scenarios like fraud detection, where you have a ranked list of anomalous transactions or users or whatever and now, you have an analyst in the loop And so, you want to show the analysts say, your number one candidate and explain to them why you think it’s anomalous Obviously, it’s not sufficient to dump 180 element vector on them and say, we think this is an anonymous user, but we want So, I’ll talk about some very simple feature-based explanations And then, if we’ve got the analyst in the loop, we can also get feedback from them and use that to re-rank our candidates as we’re presenting them And that turns out at least in our benchmarking to be really powerful And then, I’ll wrap up with two applications One which is a big sort of sustainability application involving, deploying 20,000 sensors weather stations in sub-Saharan Africa and then trying to automatically detect when stations break and sensors break and so on And the other is, sort of open world AI problem, which is the problem of open category classification in say multi-class classification So, defining anomalies So, we’re going to assume that our data are just feature vectors in dimensional real space and that we have a data set that is a mixture of what we’ll call the nominal points, so that we don’t call them the normal points and the others will call the anomalies Our assumption is that the anomalies are generated by different generative process than the nominal points But typically our data will be unlabeled, we’ll talk about that I guess on the next slide which is there are really three setting So, in some cases people have tried to formulate this as a Supervised Learning problem, so we get some labeled data, but in most of the sort of anomaly detection applications, the anomalies are quite rare Fortunately, 50 percent of our users are not trying to defraud us, although, maybe it seems like that at times And so, here we have severe class imbalance if we approach things from that perspective Another possibility is that we are training on clean data that has no anomalies in it, but then, maybe at test time we are going to get contaminated data and we need to identify those contaminants and that’s what happens in the open category case And then, another situation is where our training data is contaminated with some kinds of anomalies and we want to figure out which points those might be So, we’ll be looking at those So, I think a very important consideration here is this assumption I call the WDAD assumption The assumption that they anomalies are drawn from a well-defined probability distribution So, if just to backup, if you’re formulating things in the supervised learning way you are essentially assuming my anomalies are coming from some well-defined distribution, right, and I can learn about that But if we’re in an adversarial setting like in the fraud or cyber security kinds of settings then that really is not a valid assumption to make, right? In fact, our adversaries are constantly trying to defeat whatever distributional assumptions we’ve made Another case, in the weather network case it seems like it’s adversarial in the sense that, it’s sort of the number of ways something can be anomalous is sort of an unbounded set and I guess technically speaking it’s still a well defined probability distribution but you don’t have nearly enough data to have sampled the whole thing So, you really can’t model it And another case that’s come up, I’ve collaborated a little bit with Kiri Wagstaff, the Jet Propulsion Laboratory And their user is usually a scientist and they’re looking at data coming from some sensor like she’s worked on some spectrometers that are on the Mars science observatory And what the expert wants is to find the interesting data points And now, the first few interesting data points are actually ways that the instrument fails, and they need to understand those But then, they start being interested in unusual spectra that might signal some interesting compounds But then, once they’ve seen those, then they say, well don’t show me any more of those I want to see,

show me real anomalies and so, it’s a time changing notion So I think, we need to be sensitive to when it’s appropriate to make this assumption So there are two main strategies for Unsupervised Anomaly Detection It depends a lot on this parameter that we call Alpha, which is the fraction of the training points or test points that will be these anomalies If Alpha is reasonably large, then we can think about fitting something like a two component mixture model and actually trying to model that anomaly distribution, if we’re willing to make this well-defined distribution assumption There are various conditions for that to work Alternatively, if the proportion of anomalies is very small as it typically is in these fraud cases, then the main strategy is to treat it as a problem of outlier detection So we’re going to try to model the distribution of the nominal points, the ordinary users, and then treat the outliers as our candidate anomalies The advantage there is that we don’t have to make any assumption about the distribution But of course, if the anomalies are not outliers, then we will miss some of them Also, if the nominal distribution itself has very heavy tails so that it just naturally has a lot of outliers, this won’t work either So as I mentioned in the KDD literature, there must be 20 or 30 different algorithms that have been published There’s a huge number of papers there A problem with many of these is that, unlike in a lot of other parts of machine learning and Computer Vision and natural language processing, there’s not a large collection of standard benchmark datasets that you could do a thorough evaluation on these algorithms Often, there’s some sort of a corporate application with a proprietary dataset that they can’t share Then the most popular benchmark dataset is this computer intrusion dataset that was collected as part of a DARPA project back in the late 90s, and it turns out to be extremely easy So, we strongly believe in benchmarking and the need to have a large and constantly growing collection of anomaly benchmarks So, my student Andrew Emmott conducted a big benchmarking study Our methodology was to repurpose existing supervised learning benchmarks either classification or regression problems So we chose 19 datasets from the UC Irvine repository, which were all the datasets that satisfied a set of criteria we had We wanted real-valued features We wanted enough data. We wanted enough dimension Then we would choose one or more of the classes to be these anomaly classes, the source of our anomalies, and the rest to be the nominals This ensures that our anomaly points are being created by some process that’s different from the data that’s creating the nominals Then we systematically varied things like Alpha, the relative frequency of the anomalies, the difficulty of the points, like how deeply embedded are they inside our nominal distribution versus how much are they outliers We added redundant or irrelevant features to see how much that could defeat the algorithms We also looked at how clustered the anomaly points are Because to a large extent, these anomaly detection algorithms are all something that some form of approximate density estimation You’re essentially trying to model the probability density of the nominal points, and then score points according to how the very low density points are your anomalies Now, if the anomalies are all tightly clustered, then they form a little peak of density, they don’t look like they’re low-density anymore, and so you’ll miss them So, we were interested in that We created 20 replicates for each configuration of these manipulated values, and that gave us about 25,000 different benchmark datasets So then, we compared a set of eight different algorithms I won’t go through all of them But basically, there are some that are doing density estimation, some that we call the quantile-based methods, like the one-class Support Vector Machine is probably the best known of those It tries to find a boundary around your data that contains say, one minus Delta of your data for some quantile you choose Delta There are nearest neighbor-type distance methods, and then there are these projection methods I want to talk about these because these two algorithms turn out to be very attractive for many reasons So, the first algorithm and

the one that actually performed the best on our benchmarking exercise is something called the Isolation Forest Those of you who know me, know I love decision trees and anything random forests boosting all these things So, I had a natural affinity for this algorithm Although, when I first saw this paper came out, I said, “This is such a weird idea.” But the basic idea is we’re going to construct a fully completely random binary tree Again, recall that our data are not labeled, so we just choose one of the attributes at random Then, we choose a splitting threshold uniformly at random out of the observed range of the data Then we do that, split the data recursively We split this in some sense overfitting radically, until every data point is in its own leaf in the tree In the paper, actually, they have ways of avoiding having to do that You can stop the splitting at some point The idea, and we’ll build an ensemble of these, and these statistics were interested and is actually the depth of a point in the tree, the depth at which a point becomes completely isolated Like let’s say that my one data point x_i, is the only data point in this leaf node here, then we would say x_i, is isolated at depth three The intuition here is in some sense, we’re just throwing like a dartboard random separators at the data How many random separations do we have to do before a data point is separated from all the rest of the data? The intuition is, well, if you’re an outlier, it’s going to take only a few random shots and we’ll get you isolated Whereas if you’re deeply embedded in a tight cluster, I may have to split, split, split, split forever to get down to isolating you So, the anomaly score that we get, well, so d-bar is going to be the average isolation depth over the 100 trees say some ensemble Then, one nice thing in this paper is that they calculate the expected isolation depth from a little bit of theory So, then they can normalize this so that it lies in a zero-one scale, where anomalies are given a score as close to one, and the nominal points typically have scores less than 0.5 or around 0.5 The other algorithm I want to talk about. Yes >> Does that favor a particular basis of the features? If you have a linear transform of the features, you get a different basis This is invariant For example, if you have the point is not obviously outlier in one basis But if you transform it, for example, even like the sphere Then like it’s fuzzy around that I don’t know if you can tell just by looking at this via one basis? >> It’s true. So, of course, we’ve also looked at randomly rotating the data before we top it up It helps a little bit, but the randomness increases the variance of the isolation depth, so then we have to make the ensemble bigger So, it hasn’t been enough of a game to be worthwhile to us because the memory starts to be an issue here >> Why this business problem arise frequently enough in nature, or like in practice that you should actually care about this? >> Well, so in nature, I don’t know In our benchmark datasets and in our fraud detection applications, the features are generally meaningful We’re not doing this on images or things like this, where pixel 25 has no particular meaning at all So, yeah So I certainly wouldn’t use this out of the box on images, that would be crazy Maybe I would use it in the penultimate layer of CNN or something I’m not even convinced that would be a good idea, but yeah >> Maybe something like relationship between variables, like say, x squared plus y plus z squared, is like the meaningful variable like the outlier >> But usually, just as decision trees, basically are looking at higher-order interactions As you go deeper, you’re looking, these things pretty much follow the data distribution We only put splits where we have data, so they do a pretty good job There was an interesting paper by Nina Mishra I think, and colleagues at ICML a couple of years ago, where they analyzed a slight variant of this algorithm and showed that the isolation depth is related to an LP norm and the distance between the outlier and the data So in this case, basically an L1 norm You can sort of imagine how this would be sort of like an L1 distance If you look at these different algorithms,

I think a lot of them are essentially doing something that’s very similar I mean, the nearest neighbor algorithms are calculating some distance, and the Isolation Forest is definitely calculating something like a distance also Clearly, these things are using kernel distance, and these things are using something like a Gaussian kernel in the original space So yeah, of course, you could apply PCA to try to rectify the data in some sense The other method that I want to mention, or I should say, are there any other questions? Yes >> Do any your benchmark datasets include behavioral data or sequential data that was generated by capturing traces of someone’s behavior like networking intruder’s behavior? >> No So when we apply this in this DARPA, we were in this DARPA program called Atoms There, we were looking at an organization that had 5,000 employees We had actual desktop spyware installed, monitoring software on their desktops that they were working for some organization in the military industrial complex, so I’m allowed to say So, we could go down and really look at every email messages they sent and even the content and all that stuff But we summarized that at a daily time step, and turned it into a vector of how many emails did they sent, how many websites did they visit, how many this and that, and the other thing Then we also did some normalization to try to deal with both asking how unusual was this day compared to the typical day of this user, and how unusual is this user on this day compared to everybody else in the company on that day Because otherwise, we had these things where some new web link would be sent out through the whole company and everybody go visit it, and nobody had ever visited that link before, and we will get all excited if we’re only looking at one you user So we had to do some pretty serious feature engineering to turn this into something that was almost an ID But then, all these algorithms are purely for that sort of feature vector case In my weather data, we’re again, just having to figure out how we can engineer the features to take what is very much a spatio-temporal time series and turn it into little database views that are approximately exchangeable Okay, well, the second algorithm I want to mention is even simpler, is called LODA Given a set of data points, what it does is, is create a set of capital M, say again, say 100 sparse random projections So, it randomly chooses square root of d, where d, is the dimension of the data, square root of d, of the dimensions, and then defines a random projection onto a one-dimension, and then fits a one-dimensional density model I’ve drawn it as a kernel density estimate, but they actually use an adaptive histogram estimator Then what they do, is just calculate the average of the minus log density of a query point, so the average surprise We can see, if we took a data point like this one right here, it would have a pretty high probability in this projection But of course, if we got some other projection, then it will be mapped to a low-density tail So, this is extremely simple idea, but also works surprisingly well presumably because they’re not super however interactions in our data So looking at three or four dimensions at a time, works pretty well So then, how do we do our analysis? I point you to the archive papers for the details But we basically, chose a couple of metrics either the area under the ROC curve, which we put through a LODA transform, or the log of the lift, which is related to the area under the precision-recall curve Then, just try to set up a linear model to see it across all of these benchmarks, how much of the variation in the metric could be explained by the relative frequency of the anomalies, the difficulty of the points, the clusteredness, the amount of irrelevant features, what we call the mother set, the original, the domain from which the original dataset was drawn, and which algorithm are we using? Because I wanted to say, well, a typical finding in machine learning is that the choice of problem is much

more important than which algorithm you use, and we were curious if that was true here or not The answer is pretty much true also So, here are those five factors Here is the change in the R squared basically, how much variation can be explained in these two metrics? So area under the ROC curve, if we take a model and delete the dataset feature, how much R squared do we lose, or how much can we assign to the dataset? And that’s the most important variable for explaining the observed AUC Next is a relative frequency So, if you have a dataset where the anomalies are very rare, we can do very well because we have very little contamination As the nominees get more common, then these techniques fail Because really only one of them has any explicit attempt to be robust and use a robust loss, that’s this technique called robust kernel density estimation All the others are just plunging forward fitting some kind of a model So, the choice of algorithm is really only the third most important factor, and really not a whole lot more important than the others But with that caveat, here are the different algorithms that we tested, sorted by decreasing change in the metric compared to a control baseline What we see, is actually, the Isolation Forest was performing better than the other methods, and then there’s sort of like a five-way tie for second place Surprisingly, the one-class Support Vector Machine and Support Vector Data Description really performed the worst on these problems, which was quite a shock Because I think if you asked a random machine learning person, “If I had a one-class problem and I needed to fit it, what would you use?” And they’d all say, “Well, gee, how about Scholkopf has this algorithm.” We’re a little puzzled about this Of course, it’s possible that we just didn’t tune these properly Although, we tried a lot of things, but we wanted to have a fully automated tuning procedure So, it may be that we didn’t do that well But there have been some papers recently in literature and auto tuning those, and so we were trying to use the current best practices My own hypothesis here, is that the metrics that we’re evaluating on are really related to how well you can rank the data points Whereas these methods really just set it up as are you inside or outside of a particular quantile, a particular level set of the density So, it’s not clear that they’re really been trained for the right to do well on these metrics So, maybe that’s why they don’t do as well as some of these more distance-based or density-based approaches So, we also find in general that the Isolation Forest is a bit more robust to irrelevant features, and that it’s also a bit more robust to clustered anomaly points So that may be why it has this slight advantage over the other methods Okay. Well, let’s move on now to thinking about whether we can have some kind of a theory to explain what’s going on Because one of the things we were quite surprised at is when we look at the learning curves, these algorithms all learn quite quickly Of course, if we look at what existing sample complexity theory do we have for this kind of one-class or density estimation, it’s often quite discouraging So if you look at full density estimation, you typically need a number of examples that’s exponential in the dimension of the data, and that doesn’t match our experience For the quantile methods, there are polynomial sample complexity results So, our observations are maybe more consistent with something polynomial time So, my colleague Alan Fern came up with this theory we call rare pattern anomaly detection, and it works as follows So we imagine we’re going to define, this is very valiant style pattern learning kind of thing, we’re going to define a space of patterns or regions So we’ll define a pattern as a Boolean function that takes a point in d, dimensional space, and maps it either zero or one So it might be a half plane or an axis-parallel hyper-rectangle, some kind of region We’ll let capital H, be the set of all the patterns in our pattern space Then, the idea is to define a notion of a rare pattern and a common pattern To do this, we need to do a little bit of being careful about measures So we’re going to assume some baseline measure

So if we’re in say a hyper box in d, dimensional space, we might use just the Lebesgue measure, the uniform measure over that But if we’re in unbounded real space, we need to use something like a standard multivariate Gaussian as our baseline measure We’ll let mu(h) be the measure of that pattern, pattern h. Then, let’s let p, be the probability density of the nominal data points, and it’s defined over Rd, also And will let p(h) be the probability assigned to pattern h. So, we’ll say a pattern h, is tau-rare if the sort of normalized probability density is less than or equal to some threshold tau, and otherwise we’ll say that pattern is tau-common Then our definition of what we want, is we’re going to say, a data point now, a query point is tau-rare if there exists a tau-rare pattern that that data point belongs to So that data point belongs in pattern h, and h, is tau-rare So it means there’s some way of looking at the data, there’s some bin in the data that this query point x, belongs to that is very unusual under the nominal distribution So, then our goal, an ideal anomaly detection algorithm would output all of the tau-rare query points in a test set, and not output any of the tau-common points that would be the idea So, I guess what’s interesting about this model is that instead of trying to estimate the entire density or even trying to estimate a level set of the density, we’re just choosing some set of regions in the space and just trying to figure out roughly what their probability density is or the probability of those I think that this is why if we don’t have too many of those regions, then we can get uniformly good estimates over most of them from a relatively small sample So, we define an algorithm to be probably approximately correct RPAD If for any probability density p, and any threshold tau, it draws a polynomial size sample Then with probability one minus Epsilon, it detects all the tau-rare points and rejects, does not output any tau plus Epsilon common points So we have to introduce a sort of elbow room parameter to allow the algorithm to make some mistakes So it’s allowed if a point is between tau plus Epsilon rare and tau rare, then the algorithm can do anything it wants there So, what we end up with is an algorithm such as the following We can draw a polynomial size sample, and we’ll talk about what this is in a moment, and then we just estimate p-hat, is the fraction of the sample that belongs to each of our patterns, and then we can calculate the normalized rareness Then we just say a query point is declared to be an anomaly if one of these tau-rare patterns is made true by that query point So we can prove a sample complexity result that says that the number of samples we need is on the order of one over Epsilon squared, the log of the number of patterns plus log of one over Delta So, very kind of typical pack result It’s a little unfortunate that this is Epsilon squared here So that’s going to mean if we want Epsilon to be 0.01, then we’re going to need 10,000 This will be a big number, a big constant Then we can get a VC dimension version of this, where the VC dimension plugs in here times log of one over Epsilon squared So, there are lots of different practical algorithms you can design with this procedure So, half spaces and axis-aligned hyper-rectangles, and in our paper, we worked out the VC dimension of all of these In particular, we designed a version of the Isolation Forest that exactly matches this theory To do that, the pattern space has to be chosen a priori in this theory, it can’t be adaptive to the data So to deal with that, we use the usual trick of say, taking our training data and dividing it in half, and using half the data to grow an isolation tree to some fixed depth k, that fixes our pattern space, and then the second-half of the data, we can estimate the probability of all of these bins So, here are some examples on some datasets, where the plot on the left is for

the actual Isolation Forest algorithm grown to a fixed depth k. This is our theoretical version of it also at depth k. The vertical axis is the area under the ROC curve, and I apologize that the axes are different in the two plots Then these different lines are for different depths of tree But if we just look in this case, the depth one tree is actually works the best We see that in this case, the pattern min method actually does a little bit better than the Isolation Forest on the area under the ROC curve For another dataset particle, the Isolation Forest does much better because notice this is 0.71 down to 0.66, and this would be 0.66 down to 0.52 So, all of these curves basically are below all of these, so the Isolation Forest was much better Then, here’s a dataset shuttle, where the RPAD, the theoretically inspired algorithm is actually doing quite a bit better than Isolation Forest This time, I did get the axes aligned So I think qualitatively, well, the shapes of the curves aren’t always the same, but yeah, there’s sometimes some funny business going on with very small samples This is particularly weird that we do very well with very small samples, and there’s an interesting phenomenon going on here that we don’t fully understand But, when you have a training sample, and it includes some amount of contamination A standard practice that’s done in the Isolation Forest is to train on actually subsets of the data, not on the entire dataset, and the idea is that you’re kind of thinning out It’s a robust stratification strategy By training on smaller sub-samples, the absolute number of contaminating points is reduced The fraction isn’t reduced, but it seems that if you have a tight clustered points, you’re reducing that local density of that cluster And I think this helps the algorithms do a better job, but we don’t have a good theoretical understanding of this But empirically, even the original authors of the Isolation Forest algorithm say that you should train on a small subsets of the data rather than the entire dataset, and that this often helps For those of us in machine learning it’s like, “What, you’re not using all your data?” But of course, we’re building an ensemble So across the whole ensemble, we’re using all of our data The smaller the subsets, the bigger the ensemble needs to be, because you also have to reduce the variance that you’ve introduced by choosing these random subsets So this is a paradox that where this kind of thing, where the small samples are actually performing well It’s not observed all the time, but it’s observed enough of the time that we think it’s a real phenomenon and we’re trying to find a nice theoretical understanding of it Okay. So the conclusions from this is that we feel that this rare pattern anomaly detection theory captures a lot of the behavior of the Isolation Forest or of LODA, because these methods are solving a simpler problem than full density estimation or even estimating level sets of the density So, I guess maybe I should pause and say, are there any questions about that? Yes >> How does this algorithm compared on a high-dimensional data, very high-dimensional data? >> Well, I would guess on very high-dimensional data, you need to use a very large ensemble of trees Let’s see, what’s the right way to talk about this? If your space of rare patterns is, in some sense, if you have a distribution of the data, you want to have these rare patterns kind of surrounding the data Obviously, one intuition is in high-dimensional space, you’re going to need a lot of those patterns surrounding the cloud of data So, the number of patterns you might need to really tile the entire region might grow exponentially So, our result says the sample size is polynomial in the number of patterns, but the number of patterns that you need might be exponential So, that’s a good question >> Have you tried this on a bag-of-words? >> We have not tried this on a bag-of-words thing for example, yeah Now, one might hope, so I’ll do the standard machine learning thing and say, “Oh, but the data really lives in a low-dimensional manifold.” So all we have to do, is identify that manifold then Well, but we know that tree-type algorithms will in

some sense follow that manifold So, we would maybe we’ll still do okay I’m waving my hands. Yes >> So, it seems like it still requires knowing the pattern space But then, if you’re dealing with anomaly detection, many successful intrusions happen exactly because there is some sort of behavioral pattern that you didn’t take into account, and so the adversary was able to exploit it So, is there a way to extend or fix this theory to maybe at least give you some indication that there’s something in the data for which you don’t maybe have a pattern, or you didn’t miss the pattern that may be an anomaly? >> Well, if we think about what, for instance, the Isolation Forest does, it actually partitions the entire input space into patterns So, there is always a pattern for every possible observation Many of those patterns will be essentially empty In fact, I skipped over a little detail of the Isolation Forest But when you do a split at a node and you split the data, there is little empty spaces between the observed range of the data in the children and the observed range in the parent They’re actually, a query point can fall into that empty space in between the observed ranges I’m being very inarticulate here about this But if I have a set of data points along a line and I choose an arbitrary split, there’s going to be some space between adjacent instances and that empty space effect for the algorithm allocates two additional patterns for that empty space Because if a query point comes down, it’s considered to be isolated if it falls into this space, it was inside the range of the parent and outside the range of the child That turns out to be really important If you don’t do this right, then the performance of the algorithm suffers quite a lot So, I guess my first point would be that if we use some of the data to build the tree structure, and if the tree structure is really following the structure of the data, then we will have tiled the space nicely And every leaf node by construction is going to be rare because it will have at most one point in it So we’re going to have a lot of reasons that have essentially zero points or one in them, and we’ve also completely tiled the space So, if there is an unusual behavior, it will fall into one of these empty spaces basically and will be detected >> Then how do you tell between interesting anomalous behaviors and uninteresting anomalous behaviors? Because essentially, you won’t be able to do that >> Well, yeah, because we’re only doing outlier detection here, and we have no knowledge that an outlier We can’t tell the difference between an outlier that’s part of the natural variation of the distribution and an outlier that’s due to some unusual behavior that’s caused by something different like a fraudster or something, yeah In fact, of course, a fraudster is going to try to mimic the nominal distribution, and so we might want to try to use some essential behavior, things that the fraudulent user is not able to manipulate to help us detect their behavior But, all I’ll say is that in the competitions inside this DARPA program, we were number one in several months using these kind of techniques So, maybe that’s a tribute to our future engineering skills or maybe the red team made it too easy, I don’t know But these methods were working pretty well We’re competing against an IBM team that actually has a product in this space, so I don’t know Anything else? Okay, well, let me move on now to assuming that we can find these anomalies, how can we explain them to the user? So, our basic idea is the following We were really troubled by what constitutes a good explanation, and how could we measure that in a rigorous way without doing human subject studies So, here is the thing that we came up with We imagined that the goal of our anomaly detection system is to persuade the analyst that a candidate anomaly is a real anomaly The idea is we’re going to expose one feature at a time to the analyst, and in the worst case until we’ve presented all of the different features to the analyst Then with an appropriate visualization tool, until we’ve either convinced the analyst that it really is an anomaly or the analyst says, no, it isn’t We sort of view the analyst, their job is really to decide whether to say launch

an internal investigation like a fraud investigation, or not So, this is the the kind of model here So, we call this a sequential feature explanation We had a paper at UAI two years ago on this So in pictures, the vertical axis here is the experts or the analysts’ assessment of the probability that the data point is a nominal, not an anomaly, given the features you’ve seen so far So we’ll say that notionally it starts out pretty close to one because say, one minus Epsilon, where Epsilon is the background probability of being an anomaly In our sequential feature explanation is some permutation of the feature, say, feature number three, we think is the most important So we start presenting these features to the expert If we’re being persuasive, the experts posterior belief that this is a normal point starts to drop, and hopefully goes to zero What we do is choose some threshold, and we talk about our metric is something we call the minimum feature prefix, MFP, which in this case will be four It took us the first four features in our ordering, we’re required to drive the expert’s assessment down below say point, whatever that is, 0.25 So, this is the MFP is our metric Across the whole dataset of anomalies, we’ll look at the average of that as being our performance metric So far, anomaly detectors working well Or if we have a good explanation algorithm, it should use short explanations So this is sort of the minimum kind of thing we might have and obviously, it’s only appropriate for problems where the features are meaningful So, in this case, our technique, we relied on another one of our anomaly detectors, which is an ensemble of Gaussian mixture model, so the EGMM algorithm We had originally developed this just as a control So we apply PCA and then in the PCA space, we fit an ensemble of Gaussian mixture models, where we randomly vary the number of mixture components and the starting initialization, and we discard mixture components that didn’t fit the data very well And of the ones that survive, then we can get our anomaly scores are actually, again, the average surprise across those models So it’s the product of the density estimates I guess So, then, we use a simple greedy algorithm So, the score of a particular feature is how much the log likelihood drops when we marginalize out, or I’ve got this backwards there should probably a minus sign here It’s how much the surprise decreases when we compare the marginalized distribution to the original distribution, so right there This will always have a higher probability density after we’ve marginalized that one feature, then the full distribution had So, we do this, we score each feature and the one that led to the biggest change in the density, we say that’s the feature of the biggest impact, and then we remove it from this set of candidate features and add it to our answer and just do that in a greedy fashion Again, we used a bunch of our datasets We only used seven of the UCI domains instead of 19 in this study Instead of using real human analysts, we trained regularized random forests on the supervised version of the data, so they can see all the data Then the question was, we would convince them if they saw all the first four features, is there a predicted probability that it belongs to the anomaly class, or to the nominal class as it dropped below some threshold So, we compared against a random ordering of the features and against an Oracle that actually knows the labels of the data What we saw was that we always did better than random, and sometimes we do almost as well as the Oracle So, this is random ordering This is the Oracle in blue, and this is our greedy algorithm

Sometimes, we don’t do as well I guess one thing that’s kind of interesting is that for most of these Irvine type domains, on average, the Oracle only needs to show the analyst one or at most two features to give an explanation So, that’s convincing So, that’s kind of interesting This gives us a kind of metric of how hard a dataset is So, if we look at the KD1999 dataset, for instance, it’s always the case that you need only one feature Even our algorithm only needs to use one feature to explain an anomaly Whereas if we look at our this Adam’s insider threat database, you need about seven on average, so it’s a much more challenging problem Well, if we’re going to be talking to an analyst and convincing them, we can also get feedback from an analyst So, we wanted to also then look at, could we take the feedback from the analyst and feed it back into our anomaly detection process to re-rank the candidates that we have So, we had a paper at ICDM a year and a half ago showing that this could be a potentially big win So what we do is, we require one parameter which is a quantile, basically corresponds to how many cases can an analyst look at in a given, like in a day or a month, whatever the time period is that we’re analyzing So, the idea is that the analyst can look and say, Sixty cases in a month We want to optimize the 60 cases we present.” So we start by presenting to the analyst the highest scoring anomaly we have and then the analyst says, “Yes, that’s the case I’m going to open an investigation, or no, it isn’t.” And then we feed back that And what we’ve tried to do, is learn to reweight In this case, we’re working with the LODA anomaly detector, which remember, creates a bunch of random projections, one-dimensional random projections So we learned to reweight those projections to try to rank all of the known anomalies above quantile tau, and all of the known nominals below quantile tau To do this, we used a modification of a method from Stephen Boyd and famous colleagues called Accuracy at the Top that was published at NIPS several years ago now So, we’re re-ranking based on feedback from the expert, and we call this active anomaly detection I guess the only interesting thing is that if the anomalies are fairly rare, then we will have a significant false alarm rate So it could very well be that the first, three, or four, five, ten cases we showed to the analyst, none of them are actual fraud cases. What can we do? Well, one of the advantages of having this threshold tau is that even with no positive examples, we still learn to push down all of those non-anomalies below the threshold tau So, this seems to help us learn even without any positives, but we had to make some minor modifications So, we compared against a baseline where we do no learning, we just take our initial anomaly scores and present them in that order We looked at randomly ordering them, and then this is our method Then there are two other methods in the literature One called AI2, which I don’t think has ever been published except as a CSAIL Tech Report So it’s probably a class project or something, although it works quite well in some cases Then there’s something called SSAD, which was in a JAIR paper Their method from Gornitz, et al is called Marginalize and Cluster, I think We thought we had an idea for improving it using our Accuracy at the Top method, but as you’ll see it was not an improvement, but just for completeness So, what we’re going to plot in these plots, is along the horizontal axis is how many cases have we presented to the analyst, and the vertical axis is how many of them turned out to be true anomalies Again, all this is done with a simulated analyst, and so on So, we can see that our baseline method is this blue curve So, if we have a budget of 60 cases we can look at, we find about, I don’t know, 25 or 26 or something Whereas if we can incorporate our feedback, we get 45 So, huge advantage from having the analyst in loop However, this is the KD1999 dataset, which is extremely easy So let’s move onto some other ones So this is a Abalone dataset Here, early on, we get some benefit of our method over our baseline No feedback is the blue curve, and ours is the red curve

Then this is SSAD method, which is doing just about as well as ours But if we say, well, suppose we only had time to look at 20 cases, then we’d still be seeing more than a factor of two improvement Eventually, obviously if you go out far enough on this, there will be no benefit Because eventually, we’ll have shown everything to the analyst Here’s another one for thyroid In this case, again, compared to our baseline and we’re doing pretty well In this case, this AI2 method is doing almost as well as we are But again, the benefit is potentially almost a factor of two in terms of the number of anomalies we can detect In mammography, similar story So in general, for all of these, we’re seeing that incorporating the analyst feedback can be really powerful at least for this LODA algorithm But now, let me speculate about why this might be overly optimistic We have to remember that these datasets that we’ve constructed here are constructed from the supervised learning datasets What’s going on when we reweight these different projections, we’re essentially reweighting the original features and putting more weight on some than on others So, it could be that because our anomalies are coming from classes in the classification problem, they were presumably determined to be classes because they were believed to be separable using the features It may be they were essentially in a sort of a sneaky way of learning a classifier and taking advantage of that So, it’s not clear in a real fraud detection case would we get this same factor of two improvements However, we have a paper under review right now at KDD, where we’re doing reweighting for Isolation Forest using a sort of online learning strategy We have applied this to this advanced persistent threat cybersecurity case, which is not a supervised learning case, and we’re still seeing a substantial, not a factor two, but a substantial benefit from integrating the feedback. Yes >> Would it help to put the label to answer that question? >> Labeled? >> So you have your classes >> Right >> Would you put some label noise, it’s going to start mixing the classes, they wouldn’t be separable anymore >> Oh, you mean when we constructed, that’s interesting That’s interesting. We should maybe look at that That would be a good test at least for this particularly Yeah. I guess another thing is, we could create benchmarks where our nominal points come from say just one class and the anomalies are from the mixture of all the other classes Or maybe we should use lots of each classes so that there’s probably not a simple decision boundary that separates the nominals from the anomalies, because they’ll be very multi-modal distributions on both sides. Yeah >> So the feedback you have here is very example driven right on the analyst judging thing It seems like you can also order it to be feature-driven, or going back to pattern terminology, patterns So in fraud cases, people are often interested in patterns like a small gas station purchase followed by big purchase, is that pattern right? And instead, the anomalies are evidence that this pattern exists Have you thought about trying to order in other ways where people might be able to give a more sample-efficient feedback over an entire feature as to whether or not I believe something anomalous is happening here? >> We have not We did try to do some sort of model-based work back in the Insider threat dataset, where we had some theories of Insider threats and we designed features specifically to look for those threats, like someone copying a large number of files onto a thumb drive But you have in mind maybe more of a knowledge discovery setting where the analyst looks at our anomalies Or maybe can you do like as we’ve been discussing today, maybe some cluster analysis of these anomalies the way you guys are clustering mistakes and say, “Oh, I see sort of a pattern in this, and this looks completely bogus to me This looks like just some side effect of some instrument So, I don’t want to see any more of these cases like this, for instance, or this looks interesting to me So, everything that looks like this, let me label that whole cluster.” Yeah >> Okay. How you select the example that they set them in kind of set based criterion right If you want to exclude a whole feature, if you just randomly select examples, chances are you won’t exclude that feature Whereas if you start selecting them in conjunction,

you can do that >> Yes. We did look at whether the sort of select the biggest anomaly is the right thing to do And Weng-Keen, you may or may not know, but when he was a graduate student, he was working on rare category discovery and there you do have some diversity tricks So we could think of this as also, could we elicit from the expert Again, consistent with some of the things I’ve been hearing about here Not only is that anomaly, but I’m going to call that a full anomaly because this corresponds to a kind of use-case I know This is the check that the credit card is working before you use it pattern No, we haven’t thought about that, but it’s a great idea obviously. Yes >> Can you still reward diversity, because diversity will definitely change the efficacy of query versus anomaly So, it seems like your quick trick that can completely change your metric, or is that not true? >> No, we would really have to also have something in our metric that rewards diversity Well, as you say, if we could essentially present a group of proposed anomalies, and say we think these points are all anomalous for the same reason, are we right? And get feedback on the whole group at once, then that would be great Then the next query, we would use obviously a different group But we might have to balance them, I know in real situations I remember years ago, I visited Bell Labs and Darrell [inaudible] was working on payphone fraud for, well, it wasn’t even AT&T at the time, and they had some statistical model and they were looking for weird behavior But they also ranked them by how valuable they were in terms of the financial impact of those frauds So, that was it Okay, well, I’ve got to move along because we’re really behind So, let’s talk about these two applications So the first one is, data cleaning and weather networks So, I’m part of a group that’s working on something called the Trans-African Hydro-Meteorological Observatory, and it’s founded by two folks One is a colleague of mine John Selker, who’s a hydrologist and the other is a guy named Nick van de Giesen, who’s at the Technical University of Delft, also a hydrologist They were at some World Meteorological Organization meeting, at which this plot was put up showing where are the weather stations around the world that reliably report data to the World Meteorological Organization Those purple blue dots are the ones that are the good weather stations We can see that in most of Latin America and Africa, the weather infrastructure is just very poor So, in the context of agriculture in Africa, most agriculture is rain-fed, there’s not a lot of irrigation They noted that when you have very poor sensing, then you can’t buy crop insurance because the insurance companies don’t know what risk they would take They don’t know how to price the insurance So, that leaves the farmers to plant their plants very far apart, so that each plant under the worst case precipitation conditions will still get enough water to do well There have been field trials showing that if you do offer crop insurance, then the farmers plant much more aggressively, plant their plants much closer together and get a much higher agricultural productivity as a result So, this is one of these examples of if we could provide better data, this could lead to crop insurance industry, which could in turn lead to higher agricultural productivity John Selker is not a modest man He wants to have Africa leapfrog over the whole rest of the world and become the best sensed continents Of course, the price of sensors is dropping rapidly, and we can now build weather stations that are very cheap So, we’re trying to deploy 20,000 such stations across Africa and then create various data products I like to say, “Well, weather stations are really cool, but for some reason, people want to put them out in the weather and expose them to the rain and all kinds of problems, and then they break.” So, this is a kind of Internet of Things project I like to say, the Internet of Things is going to be the Internet or broken things So, how can we automatically detect which stations have failed, which sensors have failed? So, we might approach this as a sort of medical diagnosis problem If we could say, “Well, these sensors tend to get one in five different diseases, can we learn to recognize the symptoms?” But this is one of those cases where we feel like

there seems to be an unbounded number of ways these weather systems can fail So, where there are known symptoms, we can write rules for them or build probabilistic models, but we want to use anomaly detection to detect new kinds of failures So, I think in the interest of time I won’t go through all the technical stuff here But I’ll just say that we are combining anomaly detection scores in a dynamic Bayesian network, where we reason about the state of each sensors Then we’re doing a foreign map inference to try to say given the observations that we’ve made turned into anomalies scores, we try to not just flag the data with anomaly flags, but actually do some root cause analysis And say the reason that you got these 2,000 alarms, is due to one particular thermometer failing for some time interval So, I guess I should say, maybe the most interesting or weirdest move here is a typical problem you have when you try to do diagnosis on this, is you want to model the sensor readings as a function of the state of the sensors But these sensor distributions can be very messy, and so we replace the actual distribution sensor readings by the anomaly score assigned to those sensor readings This has a nice virtue of normalizing everything to be in zero and one, and having a more predictable distribution At least that’s our hypothesis, but we don’t have definitive results to prove this So, I’ll say, we’re currently working a lot on comparing sensor readings at multiple stations So then the other problem which is more related to open world is Open Category Classification So, given that I have training data for say K classes that I’ve collected and I trained a multi-class classifier for them, but then my test data may contain queries corresponding to new classes, how can I detect them? We call them alien queries, the queries that belong to new kinds of objects So, our approach is, given our training data, we train both our multi-class classifier but also an anomaly detector So, then when an input query comes in, we look at the anomaly score And if it’s greater than some threshold, then we reject that point and say, “We think this might belong to a new category, so we’re not going to classify it.” But otherwise, we feed through our classifier and make our prediction So, this was for me was motivated by another ecological problem which was, for about a decade I was working with some folks at the EPA and on campus at freshwater stream health monitoring So the EPA every year in the United States does randomized sampling of freshwater streams, and what they do is they use a kick-net, which is what these guys are doing here, to collect insect larvae that live on the substrate of the streams, and then they bring those specimens into the lab Currently, they manually identify them to genus or species But this is very time consuming and so we thought, “Well, we’ll build a robotic device that can singulate the specimens as they say, and then photograph them, and then we’ll use Computer Vision to tell them apart We did this back in the mid 2000s when everything was Bag of SIFT and Support Vector Machines as the classification technology But we got up into the low 90 percent correct, over ultimately 54 categories So we were all excited about that and then we started to look at actually deploying it, and we realized although our collaborators had collected specimens and we had these beautiful training data for all the specimens, the test samples are going to contain other stuff It turns out more than 29 species or more than 54 species out there, like it’s estimated there’s something like 1,200 species of freshwater macro invertebrates as they call them So this led us to think about, how could we deal with this problem? Of course, more recently in Computer Vision, this has become a popular problem So, there are several techniques that are out there but as far as we know, no one has looked at the question whether we can provide any kind of theoretical guarantees So, we are looking at a set up now, where we have two training sets One training set is of our clean training data Our collaborators went out and collected all these training data for us It’s all nicely labeled, and we know there are no contaminating samples in these But in addition, we’re going to require that we have a unlabeled collection of samples that are representative of our test queries So, here, we are making a well-defined anomaly distribution assumption that we can get a sample from the distribution of anomalies

In fact, even talk about what do we mean to have a pat guarantee on detecting say 99 percent of all the aliens, right there, I’m assuming that’s a sensible statement that there is such a distribution So, we’re going to assume that this contaminated sample, which we’ll call SM for mixture is a mixture where again, the proportion of the aliens is Alpha So, this density is a mixture of one minus Alpha times the clean distribution, p_not times Alpha times this anomaly distribution Now, what we want to find is, if let’s suppose this is the CDF of the anomaly scores of the anomaly points So, we’re really interested in, say we wanted to detect 95 percent of the aliens or anomalies, then we want to know this quantile The anomaly score, such that 95 percent of the probability mass is greater than that So we’d like to estimate this tau at 0.05 or 0.01 or whatever we want So we’ll call this q, quantile 0.05 So, we note that if this is a mixture distribution since we’re in one-dimensional world, it’s true that for the cumulative distribution functions also, that the CDF of the anomaly scores for the mixture distribution is a convex combination of the nominal distribution in the anomaly distribution So if you do a little algebra, you can say, GD, anomaly CDF can be calculated We can estimate this from data We can estimate this from data If someone will just tell us Alpha or give us maybe a reasonable upper bound on Alpha, then we can use that to estimate this CDF So, the picture looks something like this This might be the CDF of the clean data, the black one is the mixture data, and maybe this is our anomaly CDF So, we’re going to set our threshold somewhere around in here maybe We’ll get all the anomalies, but we will have some number of false alarms too So, of course, this all depends on how good the anomaly detector is, and how well they can distinguish the aliens from the known classes So, when we look in practice, this is empirical CDF of the of the clean data, this is the empirical CDF of the mixture data, and of course, we’re kind of subtracting them So, unfortunately, this thing can go negative This is the estimated CDF of the anomalies So, it’s pretty sad But if we have a larger sample, it starts to look reasonable But it still can cross this threshold a few times, and so we look at the largest value of the anomaly score that crosses this 0.05 line and we set our threshold there That’s what this pseudocode says Then we’re able to get a guarantee that if the number of samples in these two datasets is equal to n, then the required sample size is something like this We have one over Epsilon squared term, we have something that’s sort of a one over Alpha squared term, and then we have this, which is a funny thing that comes from the fact that we’re estimating two different cumulative distribution functions and there are some concentration results for CDFs So, then with probability one minus Delta, the alien detection rate will be at least one minus q plus Epsilon So once again, we need some elbow room in the form of some Epsilon Anyway, so that’s what we can do Then experimentally, I didn’t put the results here, but we have a paper under review for ICML where we also show experimental results for this Of course, it’s a loose bound but it is a bound, and it does work in practice We see that for some reasonable things, so we have some results on some image datasets and some benchmark datasets We get false alarm rates using Isolation Forest as our anomaly detector False alarm rates somewhere in the 10 to 20 percent range for reasonable sample sizes So we’re very happy with this, very pleased with it So, to finish then Outlier detection, I talked about the different unsupervised and mixed Our benchmarking suggests that this Isolation Forest algorithm is quite nice Surprisingly, that these percentile methods or quantile methods are not doing so well I should say in passing, obviously these things don’t work at all well by themselves for image data,

we’ve kind of discussed this already, any kind of signal data or time series data So, I think the next steps here, I’m very intrigued by things like GANs, and autoencoders, and other kind of manifold learning and density estimation techniques that are coming out of the deep learning literature, and so that’s the next thing I want to try There’s some very nice papers at the last couple of NIPS about this, because I think that might give us much better anomaly detectors than we can get using these simple IAD feature based methods Then we talked about the RPAD theory, which accounts for their behavior I think reasonably well Then we talked about presenting the most important features, a simple ordered feature explanation incorporating expert feedback using this Accuracy at the Top strategy Finally, guarantees for Open Category Detection I apologize for running long, and thank my funders, and open it up for questions >> Thanks a lot >> Thank you very much. Yeah >> So, at the beginning of the talk, you mentioned that there was a fundamental assumption about the nominal distribution and the anomaly distribution? >> Well, the assumption I wanted to call out is, is it safe in my application to assume that the anomalies come from a well-defined distribution >> I’m curious, can you sketch how to weaken such an assumption to allow the fraudster also in the loop not just the analyst in the loop, but an arms race between the anomaly detector and the people who are trying to gain the thing >> Well, so, I don’t know how to do that in a theoretical framework It seems to me that we end up having to think about it more like a game or minimaxing kind of adversarial learning setup instead, and try to get some guarantees that way, so that would be quite different from this Of course, there is some interesting work in adversarial learning for supervised learning Of course, everybody in deep nets right now is trying to do with adversarial examples in deep learning Obviously, some of the strategies that are being explored like using GANs and trying to map queries onto the data manifold to strip out these adversarial attacks, many are very closely related to the same ideas of could we detect the adversarial queries as being outliers that fall outside of our data distribution So, I think anomaly detection will again be part of the answer but this theory for open category will not be part of the answer there We need some other more adversarial approach >> And then just a clarification Was the DARPA Atoms thing run as an arms race between red and blue team or was it? >> Yeah, so there was a red team that was actually based on people from CERT, the computer security guys They have this whole White paper on different kinds of Insider threats in organizations that they’ve collected both from DOD and from industry My favorite one is what they called the Insider Startup, that a sub-team decides, the company doesn’t appreciate us, we’re going to take the source tree with us and go form a startup But there were other things, like someone who’s passed over for promotion, survivor guilt, all these, and so they have this whole set of things, a lot of them very psychological Then they hired a script writer to write scripts about what kinds of emails and text messages will these people be exchanging in addition to their desktop data, then they acted out all of this under this desktop and monitoring software, and then overlaid it on the behavior of real employees in the company I had access to this in a very controlled setting, and we had to see if we could find those red team inserts But unfortunately, we don’t really know how hard they were and of course, we don’t have access to the data anymore It would have been interesting to, once we have the ground truth labels to go and see well So how could we train a supervised learner, like go to the nearest neighbor classifier and figure this out quickly, we don’t know >> In the weather station example, I’m curious Mostly, the anomaly detection algorithm you talked about seemed to be optimizing for low false alarm rates and high gadgets But in some of the systems depending on how you’re reacting to the anomaly, you might be able to tolerate lots of false alarms

depending on what’s important and how you’re recovering from a problem I was curious, in the weather station, what happens when you detect an anomaly? >> Well, so, in all existing weather networks that I’m aware of, they have a lot of flags and then they rely on some human eyeballs to look at them all So for instance, Oklahoma Mesonet is the best weather network in the United States They have a big budget, and they have four full-time meteorologists for 130 stations So scaling that up to 20,000 stations in Africa, needed heck of a lot of I just don’t think that strategy scales to the Internet of Things so that’s why I think we need to bring much more intelligence to bear Also, most of the existing weather networks, they use basically hand-coded expert systems for writing these problem detection rules and so they are completely ad hoc So, I think in some sense, low-hanging fruit for us because by bringing probabilistic inference on top of the alarming, we should be able to greatly We still have lots of alarms, but we can say these 1000 alarms are really just due to one failure, and create a trouble ticket for it Well, we’re well on the way to building a system where the weather network manager, who everyday logs in and he gets a report of what are the new failures we found in the last 24 hours Then he can very easily create trouble tickets and assign those to the field engineers, who then will be scheduled to go out and do the repairs In our dreams, that would be integrated with an inventory system and on and on But that’s how they’re set up >> But you have to give a lot more knowledge than usually about how sensors failure essentially air models and must be not so much control you’re aware off, where experts have very rich models about house, they would just use like this like doing weird things, but they filter in certain ways. [inaudible] >> A mature system, presumably, ultimately we find these recurring anomalies So, then they bring them into the lab and say, “Okay, what’s going on here?” Then, we’ve already had this happen We didn’t discover with our code, but there was a manufacturing defect or design defect in the station, and so all the generation two stations are having to be replaced with But over time, you will accumulate a library of failure models and with very specialized detectors for those But we want this anomaly infrastructure as a fallback because there’s always something new We haven’t had any of our stations attacked by an elephant yet but I’m sure it’ll happen Oklahoma, they have unbelievable stories about cows getting into the thing What is the symptom of a weather station where some ants have decided to lay eggs all over the circuit board? It’s hard to have a disease model for that >> I was just wondering about the online case of anomaly detection You seem to assume here in the problem statement that the nominal distribution is stable over time >> Absolutely, yes >> What happens if the nominal distribution starts drifting over time because since the weighted identification changes, or whatever other issues? >> So, somehow we’re going to have to track that distribution Yeah, we haven’t confronted that case here Although in the weather data case, it’s not stationary but it is very predictable in an annual cycle, on a daily cycle, there are lots of periodic components that you can sort of the traditional time-series techniques, model and pull them out of the signal Then in your dream, you have now uncorrelated white noise and then it looks >> [inaudible] >> Maybe it’s a part of the answer, but actually we’ve spent a whole year trying to do good feature engineering and decided this isn’t working for us Because it’s just too much work to do all that modeling When you make modeling errors, uses artifacts, and then we start making mistakes because of those So, we’re looking now at basically, looking at nearby weather stations that should be experiencing the same disturbances, the same annual daily cycles, and then comparing them How well can we predict one using the other or what’s their joint distribution, and using that to do the anomaly detection stuff So, yeah >> So you retrain the models, or there’s some drift between predictions?

>> Yes. Actually, we need to retrain the models really sort of a sliding window thing because again, the nature of correlations changes In wintertime, for instance, precipitation is our most important variable, our most problematic sensor, and it has the smallest spatial scale in terms of spatial smoothness In the wintertime, if we think about the Pacific Northwest, it’s very smooth, very large storm It affects the whole Pacific Northwest So, if it’s raining in one place, it’s should be raining at another station and we can use that correlation But in the summer time or in the late spring when we get these showers, stations becoming extremely uncorrelated, so then we need to have in a sense switch models to use a model that corresponds to a showery day with lots of little cells Really, I think we need to be tracking that because it really is happening day-by-day weather Like the first day of a rain event, it might have this large scale and then by the third day, it’s into these little uncorrelated things So, I think we actually need to attract that kind of larger spatial phenomenon Speaking of larger spatial phenomenon, another thing is if you have a spatially clustered set of anomaly of alarms all ringing at once It’s probably a real weather event and an unusual weather event, not any of the stations broken, and that’s knowledge it’s not anywhere in our system, but the meteorologists definitely know this Of course, there are other data sources like satellites, and radar, and things that we can use as additional sources of data for this, so >> Okay. With that, we probably should thank. Thanks very much >> Thank you