AT&T Labs Fellowship Award Winner
Each year, AT&T Research through the AT&T Labs Fellowship Program (ALFP) offers three-year fellowships to outstanding under-represented minority and women students pursuing PhD studies in computing and communications-related fields. This year three students received fellowships, including Sean Sanders.
Detection of the fastest growing app
Tracking the latest killer app, preventing pressure ulcers in patients with spinal injuries, classifying network traffic, evaluating polymers as a data storage medium—what do these projects all have in common? For one, they involved Sean Sanders, a recent Georgia Tech graduate.
Well-rounded with interests in everything from math and computers to athletics and music, Sean has always been curious about how and why things work the way they do, a curiosity not confined to one particular area. So in late high school, when he was expected to start focusing on a particular area of study, it wasn’t immediately obvious to him what that would be. He ended up choosing computer engineering (he was, after all, interested in both the hardware and software aspects of computers), and it was in studying computer science that he discovered a way to both specialize in one subject while working on problems in a diversity of contexts.
It was data patterns. All his eventual projects could be better understood through analyzing patterns in data. In the pressure ulcer project, data patterns allowed Sean to determine how often wheelchair-bound patients needed to adjust their weight to prevent pressure ulcers. (These ulcers, the second leading cause of death for those with spinal injuries, result from sitting so long in one position that blood flow is cut off, and bones in the pelvic area begin to rub against the muscle and skin.)
Everything leaves clues, especially in highly interconnected networks where an activity in one place disrupts something else . . .
From sensors embedded in a seat pad, Sean collected and then classified data points according to whether the patient was stationary or in partial or full relief. No previous medical experience was needed; the information, or clues to the information, was contained in the data. Just by looking at the data and observing how certain data points correlated with events, he could make certain conclusions. The most common data points could be logically concluded to represent the most common position (stationary); other clues identified the direction of pressure relief—front, back, left, or right.
The same data skills, applied to the polymer project, allowed him to correlate a polymer’s suitability for data storage to the amount of light exposure. Almost any domain—medical, financial, retail, education, communications networks—can benefit from looking at data patterns, which can often reveal unsuspected connections between variables in the data.
As an undergraduate he was most drawn to network classification where the data sets are larger, and a lot more is going on, both good and bad. His specific interest is network forensics, which deals with the capture, recording, and analysis of network events to track the source of security attacks. Here there’s an added level of complexity because data is often intentionally mislabeled by hackers, phishers, and others seeking to obscure the source, purpose, and type of data.
Patterns, though, are harder to obscure. Everything leaves clues, especially in highly interconnected networks where an activity in one place disrupts something else, breaking the expected pattern. Finding the anomalies that announce something new, something out of the ordinary, then tracking the “why”—that’s the fun part for Sean.
But there are difficult technical issues. Algorithms and methodologies have to be able to scale; new computer methods, such as parallel processing, are needed to speed processing; more data storage and more efficient data management are also needed. Everything becomes more difficult with very large data sets. In the pressure ulcer project, the specific goal was to determine whether eight sensors, rather than the original 256, could provide sufficient data. Less data, besides requiring less storage, would also require less calibration. But less data is also less robust, requiring Sean to revise the existing methodology and algorithm to assume non-robust data.
Why does AT&T care about the hottest app? . . . Because it could impact the network.
The unique combination of skills required for massive data sets is not widely taught at universities. But these skills are increasingly in demand as datasets become ever larger.
And they were essential for his summer project at AT&T Research, where Sean worked with mentors Alexandre Gerber, Zihui Ge, and Jeffrey Erman to classify AT&T’s mobility data for the purpose of detecting and tracking the fastest growing apps.
Why does AT&T care about the hottest app?
Because it could impact the network. Some apps, particularly those that stream video and music, use a lot of bandwidth. And when these apps hit the upward trajectory of a blockbuster (PANDORA® and Netflix® come to mind), the network could be in trouble. Identifying the next big app before it hits the top 10 gives network managers time to be proactive, sometimes working with developers to make the apps more energy-efficient (see “A Call for More Energy-Efficient Apps”). But first the apps must be found.
It’s not as easy as it sounds.
The sheer amount of network data serves to obscure all but the biggest apps. Plus there were over 160,000 apps of various types and genres, all exhibiting a wide variety of patterns that weren’t yet well understood.
Researchers had been collecting app data for several months, analyzing it mostly on an ad-hoc basis to see general trends and look at particular apps. What was needed was a systematic, general procedure for consistently tracking app data. This became Sean’s summer project, and to do it he was given a set of existing tools created at AT&T Research, including a trend detection tool (based on linear regression) and an anomaly detector. These tools, however, were general-purpose or had been developed for other data sets; they were not designed for datasets as large as this one. “Tweaking” would be needed.
The data was given to Sean as a series of files, one for each day. All apps were lumped together, along with the normal noise that results from temporary outages, system maintenance, or overloaded sensors. He had to first separate out the 160,000 apps (which he identified from an ID in the HTTP header) and then normalize the data to get rid of the noise. The next step was reformatting the data into a continuous time series appropriate for the trending tool. It was a lot of prep work, but within a couple of weeks (there were, after all, 160,000 apps), patterns started to emerge in the time series data. And it was these patterns that had to be fully understood.
Some found patterns. Left, an upgrade of an existing app with new, bandwidth-intensive features resulted in more bytes per subscriber. Right: News apps on the day of the bin Laden raid exhibited a spike in both bytes and hits.
Different types of apps—gaming, news, streaming (music and video), and seasonal—are used differently on the network and their patterns reflect that. Some apps are used primarily during the week, others mostly on the weekend. Some are used only at certain times of the year. The Wimbledon app seen at the beginning of summer exhibited a pattern suggestive of a blockbuster except that its usage had a built-in shelf date. Some genres consume much more bandwidth than others. News apps, which mostly transfer text and are used only intermittently, make fewer demands on the network than do video- and music-streaming apps.
Characterizing the time series data by genre would help researchers better understand how apps are used on the network and help predict how a never-before-seen app might behave.
Another project goal was to look at the long- and short-term trends. Sean started with a seven-day interval, but quickly realized that the stronger-than-expected weekly patterns of some apps would be completely missed at this short time frame. An interval of 30 days would pick up the weekly patterns but could miss temporary but potentially interesting dips or spikes. Very long intervals (90 days) would pick up the long-term trends but wouldn’t give enough warning to catch a fast-growing app before it hit the top 10. Each interval captures and misses some information, so there was nothing to do but look at each app using multiple time frames, even though it meant tweaking for each interval.
Looking at time series data at various intervals revealed different aspects of the data. Here a golf app is seen over 7, 30, and 90 days. A 7-day was too short to overcome the app’s strong weekly cycle.
An important pattern to capture was obviously the slope direction and its steepness, information that could be captured using the trending tool, which would not only calculate the slopes of the apps but rank them according to steepness.
The tool, being general purpose, first had to be adjusted for this new data set. One critical decision to be made was how closely to fit the slope to the data. Sean needed to capture the general trend by including as many data points as possible while ignoring outliers. Making one formula work over many apps exhibiting a wide variability in usage was a slow, iterative process.
Before putting the data through the trending tool, he narrowed the search. Using heuristics and very conservative thresholds for the number of visits (1000), bytes (100MB), and unique subscribers (a mere 100), Sean whittled out the smallest apps, leaving a more manageable 10,000.
The trending tool produced both a relative slope—important for the smaller (or not-yet large) apps—and an absolute slope more biased toward the largest apps, where even a 1% increase in the number of users could substantially increase bandwidth demand. Each slope type was calculated for the number of bytes, unique users, and hits.
The general-purpose trending tool fits a linear path that intersects most data points while ignoring outliers. The time frame can heavily influence the slope.
Five weeks into the project, Sean finally had the information he sought: a ranking of the fastest-growing apps on AT&T’s mobility network for the summer of 2011, among them: iHeartRadio, Sweet Talk®, and Instagram. (In the case of Instagram, outside validation of Sean’s approach came from a Washington Post article that reported on the app’s reaching a 150 million photo mark, fully a month after Sean had first spotted the app’s steep rise.)
While he had fulfilled the stated goal—identifying the fastest growing apps—the secondary objective of data analysis is always to dig deeper and discover more about the data. In this case, what could app data say about what was going on in the network?
Anomalies can be particularly revealing. Being event-driven, they can signal something interesting in the network. It may be a one-time glitch (interesting in its own right), but it could also be evidence of an underlying problem. Or it may explain and clarify other observed events; maybe a spike in the traffic of a popular app explains a sudden spike in overall network traffic seen by network managers. One event informs and illuminates another.
To more carefully investigate the anomalies, Sean fed the time series data into an anomaly detector. Using two tools—one for identifying long-term trend patterns and one for finding anomalies that shed light on short-term events—reveals different aspects of the same data and helps ensure that an event missed in one will be found in the other.
At this point, Sean had a lot of information on 10,000 separate apps: absolute and relative slopes for each of three metrics (number of bytes, users, hits), as well as the outliers and anomalies. Part of any good data analysis is making the information usable and readable to others, and Sean spent his last weeks creating a web-based portal (again using tools created in Research) that lets researchers simply type in an app name and instantly see it graphed for a selected metric.
A web portal created by Sean enables researchers to easily pull up a variety of stats for different apps
This summer, Sean was working with a set of general tools, seeing firsthand the amount of labor it took to prepare the data and then figure out the right parameters. As a computer engineer, he knows it’s more computationally efficient to combine the separate processes into a single tool; it means fewer files, less normalization, and less parameter-tweaking. An integrated tool would aid analysts also if, instead of looking at multiple separate slopes, the various computed slopes could be compared within a single visualization, with the time intervals and anomalies labeled.
His mentors likewise realize this and, with the benefits of the project apparent, plan to keep building on the work done by Sean and further streamline the process.
But it could be that Sean himself builds such a tool. It’s ambitious, especially for someone moving from the hardware-based world of computer engineering to the analysis- and data-centric world of computer science and statistics, but working this summer side by side with computer scientists has given him a leg up on understanding the data analysis process, from the raw data stage up to what makes a good visualization.
He entered the University of North Carolina this fall to begin a PhD program in computer science.
Patterns tell a story
Similar types of apps often exhibit similar patterns. The pattern of a new, unknown app can help characterize it.
If not computer science or machine learning, what else?
Finance or biomedicine.
Role models in life.
My Dad and Gary May.
Heroes from history.
Academia, the business world, or somewhere else?
What motivates you?
Sibling rivalry with my brother.
What single course helped decide your future study?
Decided through undergraduate research.
Most fun course?
Embedded Systems Design.
Course you most regret not taking?