The Data-Driven Approach to Network Management: Innovation Delivered
This is the first in a series of articles on the scientific and engineering challenges of managing large IP networks, and the innovative solutions being studied and put in place by AT&T Research.
The job of network management gets bigger each year. Internet traffic increases approximately 50% per year. More and more applications—gaming, stock-trading, teleconferencing, and even 911—are migrating to the Internet from dedicated networks, while network providers (AT&T included) are building and integrating services such as VoIP and IPTV right into the Internet. Mobility, the fastest-growing segment of IP traffic by far, adds a whole other layer of complexity.
Required to support services and applications (and their features) is a whole host of devices—web servers, communications servers, media gateways, firewalls, multicast servers, name and directory servers, each with its own protocols, interfaces, formats, and parameters. These devices and the information they produce add immeasurable complexity to network management tasks.
And as cloud computing becomes irresistible (“pay only for what you need”), more companies will access server and computing resources via the Internet, taking the Internet steps closer to a computing environment and making it even more critical to business and the economy.
The Internet wasn’t originally designed for any of this. Once, it was simply a way to exchange email and files among a few trusted hosts. Application reliability meant that any packets lost or delayed by a network glitch (a short-duration service interruption or degradation) would be retransmitted by the application. The price was a small delay, not usually noticed for files and emails.
Reliability is judged by the user experience.
But delay is disruptive to real-time applications such as voice and video. Since any resent packets would be obsolete on arrival, many applications don’t re-transmit lost or delayed packets, and depend increasingly on the network being ultra-reliable and providing seamless performance. This places the onus on network providers to minimize even the smallest of glitches that can interfere with services.
Reliability is judged by the user experience. If pixels are missing from a video image, if voices are breaking up, or if gaming instructions don’t execute (or do it fast enough), it’s perceived as a network problem. Whether or not the problem is actually with the network itself (or with application devices or software) is almost beside the point. The network provider is now also a service provider and responsible for tracking the problem, working if necessary with the vendors or other third parties to fix the problems, large or small.
Services and applications over the Internet and their attendant devices and interfaces add immeasurable complexity to the network management task. There are thousands of devices from many different vendors all with slightly different configuration requirements and reporting a crushing number of events and statistics, all differently. Software is becoming increasingly complex. New applications come online all the time, each contributing new reporting data and metrics to a network awash in information. All these devices, new and old, interact in a myriad number of ways that can be very hard to predict.
Packets still get lost, but the reasons are harder to decipher. Is it an old-time fiber cut, or maybe a parameter conflict between two mega-routers, a timeout set too fast to open a new session, or somebody’s set-top box crashing when encountering channels with a certain bit rate setting? The sheer number of events and factors to consider is almost beyond human capability.
With each event generating alarms at multiple layers and multiple devices, it’s hard to even know which alarms to react to. Which is the precipitating event? Which a sympton only? Is a nearby alarm related or just a coincidence?
At AT&T, network management is the responsibility of the Network Operations (Ops) group. This group has overall responsibility for monitoring network health, upgrading software and hardware, performing repairs, configuring devices, and fixing problems as they arise. It’s a tremendous job, but Ops gets support from other departments such as AT&T Labs and IT (Information Technology) that provide tools and other technological capabilities. The result is that Ops can handle and address everyday problems such as fiber cuts and outages so efficiently that the size of the Ops group has remained constant even as Internet scale and complexity have exploded.
Fiber cuts, hardware failures and other high, persistent error conditions require immediate investigation and intervention so it’s natural for traditional network management systems and operations groups to focus on such events. The danger is that the intense focus on these events allows smaller and more transitory glitches, such as short bursts of packet loss, short delay spikes, or network protocol flaps (see sidebar), to often fly under the Ops’ radar, even though they too impact service. Such glitches by their very nature disappear quickly, making them hard for operators to track or learn how they affect the user experience. As more services come online, particularly real-time applications that are sensitive to the smallest and most intermittent of glitches, understanding how network performance impacts different services is increasingly important. If underlying issues go unnoticed, opportunities for improving service quality may be missed.
To prevent missed opportunities and to better understand the relationship between service performance and network issues, Ops has forged a close working relationship with AT&T Research, where scientists and engineers investigate fundamental scientific, mathematical, and engineering questions touching on information technology and communications, the business of AT&T. The idea is that Research can use its knowledge of data mining and sophisticated statistical analysis and apply it to the challenges of network management.
Like two sets of detectives at a crime scene . . . Ops examines clues at the scene. . . and Research compares them with past crimes
Working together, Ops and Research brainstorm possible technological solutions to address generic and recurring network issues. When seemingly mind-boggling network incidents occur, Ops will lead the post-mortem, but Research may get involved. Like two sets of detectives at a crime scene, Ops and Research bring different approaches: Ops examines clues at the scene to reconstruct what happened, while Research also looks at the clues but hurries back to the office computer to compare the clue set with sets from past crimes, hoping the pattern of similarities will point to a culprit. Ops has responsibility to resolve issues in the network, but Research will often take the “lessons learned” from these incidents to create innovative new techniques for detecting, isolating, and managing such conditions in the future.
Research itself collaborates with others, both within AT&T (Labs, IT) and with the academic community. Graduate-level students work side by side with researchers, and are often the leads on creating innovative new prototypes or detailed data analyses that deal with challenging network-related issues.
The long-term goal of these efforts is to minimize or even permanently eliminate glitches and other hard-to-track problems to make the network ultra-reliable and capable of providing services with consistent, high-quality performance.
The key to a data-driven approach is, of course, data. To collect data, and lots of it, Research created the Darkstar project to bring together information from across the network, including log files, SNMP counters, end-to-end service performance measures, network and customer tickets, and other sources. Data is collated across multiple networks and technologies, including IP/ MPLS, mobility, IPTV, Ethernet access, intelligent optical network elements, and layer one (“transport”) systems. Data is streamed into the system in real time and stored for historical analyses (typically with at least one year’s worth of history). To host the data, Darkstar uses the Daytona database (another Research project), which effectively handles the tremendous scale of the data, being constrained purely by the physical storage itself.
Giving researchers access to massive amounts of real network data . . . increases the chance that what they learn will benefit network management.
Darkstar thus serves as a comprehensive, one-stop resource to consolidate information from all network data sources and make the data easily available to those who need it. Operators, for example, can get a complete understanding of network conditions or events from a single data warehouse. This is a far cry from the traditional time-consuming approach that required operators to compile data manually from many different devices and network management systems (silos), often across different vendors, each reporting different statistics, from different time zones, and at varying intervals: daily configuration reports, but SNMP data (packet counts, etc.) reported every five minutes. Operators even had to take into account that devices were referenced in different ways by different devices or network layers (by a circuit identifier, an IP address, or an interface name).
Darkstar obliterates these silos, pulling all the data together and normalizing it so that it can be readily correlated. The normalization across naming conventions, time zones, and identifiers is performed as data is ingested into the Darkstar framework. This eliminates the need for the operator, data miner, or researcher to be painfully aware of the original data source details and mechanisms required to translate to common conventions and time zones. Correlations across multiple data sources can now be achieved in a matter of seconds, instead of hours or days.
The data also has an altogether other purpose: to attract the interest of researchers who need data to test their ideas or see how their algorithms work in the real world. Giving researchers access to massive amounts of real network data so they can “play around” increases the chance that what they learn will benefit network management. The more people who look at the data, the better.
Automatically managing hundreds of database tables being updated in real-time required a whole new technology, DataDepot. Created by the AT&T database Research group, DataDepot gives a sense of order to the data by organizing the information into tables in a way that makes sense for network management tasks. Errors go into one table for easy tracking. Other tables contain link speeds, link utilization rates, network performance measurements and other information obtained by monitoring network health. As data is fed in, DataDepot tracks where the data comes from, what tables it needs to populate, and what calculations need to be done (for example, joining across multiple tables to identify the router interfaces associated with lower layer network events).
With the data organized, normalized, and aligned in time, tools are easier to build.
All this (receiving data, normalizing it, verifying it, tracking which data sources populate each cell) is done in real-time on 1,006,498 new rows per day (and growing). Data is almost instantly available and usable to applications, researchers, and network operators.
The groundwork of planning and building Darkstar took years, but the result is an easily accessible data resource that can be used for the long term in many different ways by a wide variety of people.
Darkstar also serves as a platform for easily creating innovative prototype tools designed to have fundamental impact on how networks are managed. New tools are considerably simpler to prototype and experiment with since they don’t themselves have the overhead of obtaining access to the data sources and then normalizing the data to make it comparable. By way of an example, two simple yet extremely powerful tools were rapidly designed and prototyped to help Ops with troubleshooting network and service events: RouterMiner to instantly retrieve diverse information related to network routers of interest, and PathMiner to display all events (and information about devices) along a path between two devices.
Prior to Darkstar, troubleshooting a network event often required pulling and manually reviewing router logs (which can record thousands of transactions per second), router and service performance measurements, and a range of other data. These sources would have all been obtained from different tools, websites, and servers, or even using active tests on the live network (e.g., running ping or traceroute to determine a likely route through the network). It could take days to collect the information if multiple groups are involved (e.g., if the IP group calls on the layer one Ops team for relevant data), separate the relevant information from the non-relevant (is an alarm on another device related to this problem or another?), and create a plausible time sequence of events amidst the different time stamps, the multiple device names, and the cascade of alarms.
RouterMiner and PathMiner revolutionize this process. Built above the Darkstar warehouse, they were readily prototyped, automating the data collection, correlation, and placing of events in chronological order. But the benefit is anything but trivial: up to weeks of manual work reduced to as little as a few tens of seconds.
Investigating potential service issues across the network now requires only that Ops open the PathMiner web page, enter a time frame and the beginning and end devices. Using routing events collected from the network, PathMiner automatically infers the route(s) that traffic would have taken during the time interval of interest, and collates all relevant events across those routes. Within seconds, Ops has all the data available for isolating the issue along with a chain of evidence (with all events filtered and arranged in order of occurrence) and a more unified picture than is possible by looking at devices or individual data sources in isolation.
Data stored in DarkStar is a huge dataset containing in it clues to one-off glitches, silent failures, and other hard-to-detect problems.
The Information Visualization research group has taken the same normalized, time-aligned Darkstar information and interpreted it in a map form, with clickable links for drilling down for greater detail. A simple glance is often enough to know whether three links reporting problems are related.
This is just one example of how visual representations made from Darkstar data can help Ops quickly grasp high-level summaries of the data while interactively exploring details in context. Other experiments in visualizations might explore connections or linkages between different types of network events. A wealth of new and exciting opportunities abound with the network data at the visualization team’s fingertips.
Tools built on the Darkstar platform vastly reduce the time it takes to troubleshoot individual network issues. But the more significant and longer-lasting benefits will come from exploratory data mining on Darkstar data. Such analyses are critical to understanding the underlying relationship between the network behavior and service impairments, including barely noticeable pixilations in the IPTV picture quality up through network failures that impact customers for extended periods. Such analyses will answer questions too difficult or requiring too much information to answer using traditional means: What issues may be occurring in my network that I am not even aware of? How does network performance relate to service performance (e.g., for VOIP, IPTV and teleconferencing)? How does performance change in my network with changes in network software and configuration?
Knowing how individual events impact services allows Ops to get ahead of the problems and address them even before customers notice, a moving goal line as customers both learn to better monitor their networks and continue to drive increasingly sensitive applications onto IP infrastructures.
Exploratory data mining is critical to delving into the next level of detail required for continued service improvement. Such analysis can automatically identify causes of performance impairments, eliminating the need to execute painful collation of domain knowledge from across many experts, while also revealing previously unknown failure modes that may be readily driven out of the network once understood (e.g., those caused by software bugs).
Exploratory data mining is dramatically broadening the types of events and network permutations that Ops can track, especially the one-off glitches that occur and come and go so quickly it’s nearly impossible for Ops to get a good description or suspect the actual causes. If an event is truly a one-off, relatively minor glitch, there’s little point in using up valuable resources to investigate it. However, if other similar events can be found, the event can be compared with other instances to better understand the commonalities, and thus the causes. Looking at each individual packet loss or protocol flap manually is not practical when there are simply too many of these small events. But without detailed and scalable analysis, underlying issues may go unnoticed, and opportunities for improving service quality sadly be missed.
Because Darkstar already aggregates vast amounts of data, it was relatively easy for data miners in Research to build tools to detect and analyze correlations across the network to automatically learn about the complex behaviors that exist in large-scale, operational networks.
One such tool, NICE (Network-Wide Information Correlation and Exploration), searches Darkstar data for all instances of “symptom” events of interest such as service degradations (video glitches, slow webpage downloads, dropped calls), packet losses, and protocol flaps. It then automatically learns about other types of events that may be related to the symptom events by testing the statistical correlation with all sorts of other network logs to determine what other events consistently (in a statistically significant manner) co-occur with the symptom event and are within the same local topology (NICE’s benchmarks for establishing correlation).
In analyzing even the smallest packet losses, NICE can look for other nearby events (in the same local topology) that consistently occur immediately before or after the packet drops. If co-occurrences keep happening and are statistically correlated, it likely reveals some sort of causal or impact relationship. What seemed like random events, when correlated with other instances, may prove to have identifiable (and even preventable) causes. Some events may be well known root causes, while others may be coming from left field.
Using NICE, researchers discovered that lower-level events were interfering with actions at the upper layers, something that was not planned for in the protocols and shouldn’t be happening. But once the data was collected and correlated (and further investigated using RouterMiner and PathMiner), the causal relationship was undeniable, and Ops worked closely with the vendor to drive this particular problem not only from AT&T’s network, but (through router software fixes) from everybody else’s network as well.
NICE aids Ops in understanding the potential root causes and impacts of specific types of conditions. However, it is also critical to understand how network maintenance activities impact network and service performance. As new software is introduced across the network during router software upgrades, network performance metrics must be tracked closely to determine the impact in the operational network. Despite extensive lab testing before deployment, intrinsic bugs and latent issues may be revealed only in the scale of the vast and harsh environment of the operational network. Ops needs to search wholesale for unexpected and undefined problems. Whenever routers are upgraded, Ops must both verify that the expected changes occurred (and performance improved) and locate unexpected changes that may be negatively impacting customers or device health. Achieving this manually is unrealistic due to the vast volume of data that may (or may not) include symptoms, and the potential subtlety of the trend changes.
Instead, Research created the Mercury tool to examine network metrics (e.g., CPU, memory, router syslogs) before and after maintenance. But it’s far from a simple line-by-line reading of the files. Router logs in particular require a tremendous amount of pre-processing (to remove interface names and IP addresses, for example) so that the text messages can be sorted and aggregated by type across a router. This pre-processing and aggregation needs to be performed without detailed understanding of the format of each type of “free text” syslog message, otherwise the analysis will not scale to the massive range of syslog messages logged. Advanced statistical techniques are then used to determine statistically relevant and persistent changes in the network metrics reported for each network router: the rate of different types of syslog messages (e.g., logs indicating link flaps), the average CPU/memory utilization.
Even more challenging is detecting changes when events are rare; performance changes may be “in the noise” on individual routers and only perceptible when aggregated across multiple routers of a common type experiencing the same network maintenance (e.g., software upgrade). Aggregating across such routers detects issues that risk otherwise flying under the radar.
The staggered scheduling of router upgrades further complicates analysis. A few routers are upgraded initially to “test the waters.” As confidence builds in the new software version, deployment is scaled up, but it can still take weeks to roll out across large numbers of routers. By aligning all routers in time so that routers are upgraded at a nominal time “zero,” Mercury allows event counts to be aggregated across all routers even when they are upgraded at different times. Counts can then be compared before and after upgrades to identify performance changes that may only be visible in the aggregate.
Mercury is being used by Ops to monitor at scale the performance impact of router upgrades across AT&T’s massive IP backbone. Thousands of unique metrics (distinct logs, performance measures) across thousands of routers are being continuously analyzed. Although a new capability, Mercury has already revealed interesting behaviors that are being further investigated by Ops and relevant network element vendors.
Writing rules to chase out problems
NICE and other statistical “learning” tools help reveal new issues by searching through vast amounts of data to identify the full set of characteristics surrounding a problem. Different tools are required to automatically identify the root cause of a given, individual service or network impairment, or to quantify the contribution of different root causes across large numbers of impairments. So Research created its own root cause tool, G-RCA (Generic Root Cause Analysis), which encodes expert rules obtained from network experts or automatically learned through NICE.
G-RCA’s innovation is not in the tool itself (root cause tools are by now commonplace) but in the scale of its use and the specific types of problems it’s used to analyze. G-RCA can search across hundreds of thousands of events or across 1,000s of routers based on specific high-level events. G-RCA applications focus on analyzing events that fall outside the traditional fault and performance systems used by network operators: service impairments (e.g., video impairments), packet losses across a network, high network element CPU usage, or BGP or other types of flaps. Instead of being used purely for the individual event investigations that Ops specializes in, G-RCA can trend events and their associated root causes, showing how events are distributed across underlying causes. When presented as a pie chart, operators can immediately see what’s having the biggest impact on network performance. Such capabilities are leading Ops into new realms by facilitating the understanding of large numbers of small events, enabling Ops to drive service improvements by minimizing or even permanently eliminating failure modes.
Using G-RCA to analyze end-to-end packet loss over a single 10-day period
Adding new applications and new rules to existing applications is almost trivial to achieve through G-RCA. In less than a year, a small team of researchers prototyped applications that automatically identified the root causes of even the smallest of end-to-end packet loss events across AT&T’s IP backbones, protocol flaps (BGP, PIM) in the IP and IPTV backbones and service impairments in AT&T’s content distribution network. These applications are used by network Ops and customer service representatives to rapidly troubleshoot individual events (e.g., rapidly address customer concerns on individual BGP flaps) and to trend the rate of impairments and their underlying root causes in a bid to drive continued performance improvements.
Because G-RCA can scale across the network to identify and analyze events that previously fell outside traditional fault detection, Ops is being given almost for the first time an opportunity to fundamentally understand the complex relationship between network events and service performance, an understanding that will go far to permanently fixing service impairments.
Ever smaller events can and will be identified and potentially eliminated using G-RCA, increasing the reliability of the network to deliver applications and services with consistently high performance.
So by creating tools such as G-RCA, NICE, and others to come (tools capable of aggregating rare and transitory events and performing statistical correlations) AT&T Research is providing innovative techniques and tools to make the network ever more reliable.