Edward McFowland III
AT&T Labs Fellowship Award Winner
Each year, AT&T Research through its AT&T Labs Fellowship Program (ALFP) offers three-year fellowships to outstanding under-represented minority and women students pursuing PhD studies in computing and communications-related fields. This year three students received fellowships. Ed McFowland III is one.
Using anomaly detection to ensure data integrity
Edward McFowland III is someone who plans ahead, makes goals, and then sticks to them. It’s a simple formula, but powerful. It got him accepted into Yale, a goal he made as an unformed high-school freshman wanting to impress but whose writing skills needed work. But he did the work, staying after school two or three times a week for three years to get extra writing help. And he did achieve his goal.
Since that time a lot has changed, and the goals have become more informed. Now entering his second year of graduate school, his current goal is particularly ambitious: nonparametric anomaly detection in general data.
The idea behind anomaly detection is simple (setting aside the “nonparametric” part of it for the moment). You model what you expect to find, and you model what you actually see, and you compare. Where there’s a sufficiently significant deviation between the actual and the expected, you have an anomaly, and an indication that something of interest may be occurring.
One classic use of anomaly detection is stopping credit card fraud. Because people have unique buying patterns, credit card companies are able to create customer profiles based on an individual’s past purchasing patterns. When a new purchase doesn’t fit the expected pattern (a 70-year old Manhattan female and a chrome hubcap for a ’68 Chevy, for instance), alarms go off at the credit card company.
But what if you want to find anomalies but have no concrete means to interpret or model the past history and therefore develop expectations?
Statistics provides one solution by offering probability distributions (binomial, Gaussian, Poisson, exponential, etc.) based on assumptions about the process generating the data, or various aspects of the data: is it discrete or continuous, is it temporal in nature?
. . . the probability of one hospital reporting a slightly elevated number of cases is not statistically significant, two or three other hospitals within the same area having the same slight increase is significant . . .
As one example, Ed’s academic advisor (Daniel Neill) is doing disease surveillance by looking for larger-than-expected numbers of hospital visits for various symptoms (shortness of breath, coughing, flu-like symptoms) along with sales of related medication. One data set contains counts (the number of hospital visits to local Pittsburgh hospitals within each zipcode) and involves some average probability of a discrete event happening within a certain interval, making the Poisson distribution applicable and giving him a baseline of how many visits can be expected.
So far this is traditional statistical methodology, though adapted for the specific problem. But Ed and his advisor are applying computer science techniques to go one step further, and applying statistical methods to a massive dataset, in this case, all US hospitals. Instead of looking at each hospital in isolation, it becomes possible to compare each hospital to every other hospital and thus discover more subtle trends than would otherwise be apparent. So while the probability of one hospital reporting a slightly elevated number of cases is not statistically significant, two or three other hospitals within the same area having the same slight increase is significant, and much more noticeable. In fighting epidemics, the earlier warning afforded by finding clusters of outbreaks could potentially save many lives.
The cost of searching over groups of hospitals to find more subtle patterns is computationally complex, and grows exponentially more complex for every location considered. One way to reduce the computational complexity is by scanning the database only for some groups of records and not others. The problem has always been ensuring that the most anomalous groups are found. Ed’s advisor has developed a method guaranteed to find the most anomalous group. Called Linear Time Subset Scanning (LTSS), it works like this: if a scoring function (which evaluates the anomalousness of a group of records) in conjunction with a priority function (which evaluates how anomalous an individual record is) satisfies LTSS, then the most anomalous subset must be one of at most N (number of locations in the dataset) possibilities; the cost of searching is now linear instead of exponential.
. . . his method of fast-scanning nonparametrically in general data is being put to use for ensuring the quality of data received from cell towers.
Ed is utilizing this LTSS theory, developed originally for spatial-temporal datasets (such as the hospital-visit studies), to creating new algorithms to find patterns in general datasets. His most recent and notable algorithm is Fast Generalized Subset Scan (FGSS), which is capable of efficiently finding, and characterizing, anomalous groups of data. General datasets present a new set of challenges since Ed is unable to assume a parametric distribution or have any pre-existing model or expectations to guide him. Right now there are few methods that find patterns or groups of anomalies in general data, without knowing a priori what the anomalies “look like”, though they suffer from various problems most notably the trade off between scalability and detection power. Ed is one of the first exploring this area from the standpoint of maintaining both scalability to massive datasets and high detection power.
His FGSS algorithm, which forms the central topic of his PhD research, is to look to the data itself to determine what is expected and what is anomalous. To do this, he first learns the conditional probability distribution from data to model the expected data distribution. He then arranges the data as a matrix, with records as rows and columns for attributes. In a matrix of sensor-reported data, for example, each record represents a single sensor, and each column represents a single measurement (attribute) taken for a day.
Each value taken from a sensor (left) is converted to a probability (right), based on the distribution of values across the entire data set.
He then assigns each cell a probability based on how likely it is that a particular value should appear there (using a null model, which models how data values are distributed when sensors are operating normally; see sidebar), taking into account dependencies that might change the anomalous score. For instance, if the sensor measures the number of cars crossing a bridge each day, the anomalousness may change, depending on whether it’s a work day or the weekend. For more about the null model, see the sidebar.
Then the FGSS algorithm, which utilizes the LTSS property pioneered by his advisor, looks for co-occurring, anomalous records.
Similar to the hospital project, one record with an anomalous attribute value is not that interesting but six records with anomalous values is much more interesting, strongly indicative of a systematic problem, such as a malfunctioning sensor. Moreover searching for co-occurring groups of anomalous records allows FGSS to find very subtle anomalous patterns. One sensor with a metric value that is slightly elevated, given the values of the record’s other metrics, probably indicates noise. However, six sensors with the same metric slightly elevated, given the values of their other metrics, is typically much more interesting. Similarly, one sensor reporting a slight elevated metric value for five sequential days straight is also much more meaningful than an isolated one-day report. Groups provide much more information than single records.
Substitute “cell tower” for “sensor,” and you have the gist of his summer intern project at AT&T Research: Identifying and categorizing glitches in mobility data. Here his method of fast-scanning nonparametrically in general data is being put to use for ensuring the quality of data received from cell towers.
How does a glitch differ from an anomaly? A glitch is an artifact of an error in the data collection process (data might be missing, duplicated or logged incorrectly), whereas an anomaly is strange or unlikely behavior, correctly logged, that helps network planners understand what’s going on in the network. A cell tower in small town reporting 0 usage weekday afternoons until 5:00 might seem strange, but there may be a simple explanation: everyone works in a nearby city so the tower is not used during this time. This is interesting information for network planners who could replace the tower with one that consumes less power or requires less maintenance.
A glitch, though, is a mistake; it’s not real data and therefore shouldn’t be stored along with data used for decision-making. A tower in Manhattan reporting 0 afternoon usage is clearly sending wrong information and therefore shouldn’t be used for resource allocation or other network planning decisions.
To find data glitches, cell tower values are scored multiple times, once per anomaly detector. These separate scores are combined into a single glitch score, which is then assigned a probability (empirical p-value), taking into account the distribution of glitch scores across the data set.
But, glitches and anomalies are similar in that both differ from expected values, and for this reason, anomaly detection is often used for finding glitches.
The usual method is to look at glitches individually since the conventional wisdom is that different types of glitches occur independently of one another, that missing data is not connected with outliers, for example. However, this is not true for the network, where one underlying problem can spawn different types of glitches. As just one example, an overloaded link (an outlier for usage) might cause pollers to stop polling (missing data). If only glitches were looked at by type, this pattern would be missed.
More interesting, from a network point of view, are co-occurring glitches, regardless of the glitch type. So Tamraparni Dasu, one of Ed’s mentors at AT&T Research, had created along with colleagues in database research, a framework for using a series of anomaly detectors—Is the data missing? Is it a duplicate? Is it an outlier?—to assign several measures of anomalousness to each attribute, measures she then combines into an aggregated score.
Ed’s arrival held the promise that AT&T would be able to expand its current anomaly detection to more efficiently find exactly where the data glitches were concentrated. About four to five weeks were needed to work through the theoretical facets of a joint method, with Ed and his mentors making sure each step of the algorithm, which was now based on multiple anomaly methods instead of solely on a probability distribution (Ed’s FGSS method had previously used a null model to compute probabilities, which were then turned into empirical p-values), was statistically sound. Coding took another four weeks (Ed’s code was designed to work with a probability distribution not multiple anomaly methods). After that it was smooth sailing. Good results were seen almost immediately, and with little tweaking.
Ed’s method has had immediate value, making it possible to more quickly and efficiently find exactly where in the data glitches are concentrated.
He’s particularly motivated by the need for pattern detection in the policy-making process, which is particularly important in developing countries.
More specific information about the FGSS algorithm will be published later this fall in a paper to be co-authored by Ed and his advisor: “Fast Generalized Subset Scan for Anomalous Pattern Detection.” This is the first of four planned big papers, approximately one per year (interspersed with smaller papers), each further extending the method of nonparametric anomalous pattern detection, toward the goal of a General Framework for Detecting and Discovering Anomalous Patterns in Data.
After what he’s accomplished at AT&T this summer, it’s hard to think he will be only in his second year of graduate school when he returns this fall to Carnegie Mellon University (CMU), where he also did his undergraduate work. Attending the same university for both undergraduate and graduate levels is not the expected course, but in Ed’s case his undergraduate work—information systems—differs enough from his graduate level work, with its heavy emphasis in statistics and machine learning, that both he and CMU are extremely satisfied with the arrangement.
And Yale? Ed never made it to Yale. He intended to go, but an unplanned, highly anomalous event intervened. When he called to notify CMU he would not attend, he was challenged. How could he reject CMU without having visited? (Ed hadn’t pre-visited any of the universities he applied to.) A plane ticket was sent, and two days later, CMU was able to able to persuade Ed to change his mind. As Ed sees it, the decision to change course was one of the best things that happened to him. CMU’s Machine Learning, Information Systems, and Public Policy departments are among the strongest in the country, and he has forged an extremely productive relationship with his advisor.
Ed’s future plan (because with Ed, there is always a plan) is to stay in academia, where he feels his skills can do the most good for the most people. His real interest is people and society, and he wants his work with data to make a difference in people’s lives (at CMU, he sits in the Information Systems and Public Policy department). He’s particularly motivated by the need for pattern detection in the policy-making process, which is particularly important in developing countries, where data quality and the lack of data are especially problematic. This is why he was so excited about his internship with AT&T Research, because he immediately saw his work could have more applications for problems far more reaching than AT&T cell towers. Data glitches in developing countries are more frequent, harder to find, much more costly to fix, and lead to even more problematic circumstances when unfixed. Another interesting problem particularly relevant to the developing world is prioritizing data streams. If acquiring and using a data stream cost X (which can be measured in time, dollars, effort, etc.), picking streams that are more useful can fit better within cost constraints.
On a more personal level, life in academia will give him the opportunity to mentor students, especially minority students, and instill in them an appreciation for what science, math, and engineering can accomplish for society at large.
Parametric vs nonparametric
The difference between parametric and nonparametric is simply this: In general, parametric models assume an a priori family of distributions (such as Gaussian). The data are used to estimate the parameters of the distribution to identify the specific distribution from the family that best fits the data at hand. Nonparametric distributions make no assumption about what a distribution should look like. Both have advantages and disadvantages.
Parametric modeling lets you assume more information about a dataset. The amount of data and computation needed to estimate the parameters of a distribution is relatively small. But choosing a distribution that doesn’t fit the data can lead to incorrect (perhaps even wildly incorrect) conclusions. Non-parametric modeling avoids making false assumptions about the data, but because you can assume less information, it takes a large amount of data and computation to estimate the distribution reliably.
A null model is created from data which represents the distributions of values when sensors are behaving normally.
For example, when Sensor 2 = 4 then Sensor 3 = Y is the usual (or normal) behavior and the null model would capture that dependence between Sensor 2 and Sensor 3. So for record 1, the null model determines the probability—also known as performing inference on the model—that Sensor1 = 31.1093 GIVEN sensor2 = 4, sensor 3 = Y and sensor 4 = 4.20938. The response is a probability (which according to the graphic is 95%). The null model can then be used to determine the probability that Sensor 2 = 4 GIVEN Sensor1 = 31.1093 sensor 3 = Y and sensor 4 = 4.20938 which according to the graphic is 90%. This is done for every column and every row. )
If not data mining or machine learning, what else?
Academically I would study chemistry (medicine), or Latin American language and culture. Non-academically it would be professional baseball player (if I could be good enough).
Role models in life
I know it's cliche but my mom and my sister are my best role models. They made me the person I am today, and everything I do is to make them proud.
Heroes from history or real life.
Roland Fryer Jr., Allen Newell & Herbert Simon.
Academia, the business world, or somewhere else?
What motivates you?
Equally, the people along the way who didn't think I would amount to much and those who always believed in me.
What single course helped decide your future study?
Intelligent Decision Support Systems, it was the first course that introduced me to Data Mining.
Most fun course?
Machine Learning. I had a very tough instructor but I learned a lot, made some very good friends, and had so much fun in the process.
Course you most regretted not taking?
Professor and head of my own academic research lab.