article

SQM Research: Reinventing Customer Experience

by: By Spencer Seidel, Tue May 19 11:17:00 EDT 2015

You’ve probably seen the ads. Two geeky guys traveling across the USA, making sure AT&T has the nation’s strongest LTE signal. Cute, right? Funny, too.

But have you ever really considered how all those cat videos get to your smartphone? How truly amazing it is that a little electronic gizmo can connect you to almost anyone in the world in an instant? Just how monstrous a task those two geeky guys have taken on?

The true story behind those ads is that keeping AT&T services operating at peak performance takes thousands of technicians, all supported by intelligent systems and processes. One group enabling this effort are the people in the award-winning Service Quality Management (SQM) Research program at AT&T Labs. They won the 2014 Industry Star Mobility Excellence Award for Innovations in Service Quality Management and the 2014 Broadband Infovision Award for Best Network Intelligence Innovation. They’re obsessed with creating a state-of-the-art network that delivers a seamless service experience for over one hundred million customers, and their cutting-edge innovations are revolutionizing the way large networks and services are managed.

 

Big Data

The term "big data" is thrown around a lot these days. Though the origin of the term is somewhat murky, in the last few years it has become trendy enough to include in job titles.

For the SQM Research team, big data is kind of the point. Telecommunications networks and the services that run on top of them transmit data. Lots of data. Consider the massive scale. Hundreds of thousands of network devices, including cell towers, routers, and interface cards, transmitting vast amounts of network and service data, 24 hours a day, 7 days a week, 365 days a year. In addition to shuttling packets containing the bits of your cat videos, all of those services and network elements produce information about their operation and performance — information like throughput, latency, CPU usage, free memory, transmission delays, errors, packet retransmission rates, log messages, and on and on.

This service performance and networking data is a technological goldmine for the SQM team.

 

Digging for Gold

At the heart of the SQM Research program is a shared-computing infrastructure. Hundreds of high-powered multi-core servers running sophisticated data-management software built specifically for the highly complex task of collecting and organizing all manner of data, collected within AT&T’s network.

But how is it done when the size of the data in question is so staggeringly massive? The SQM team uses clever tricks to intelligently gather aggregate data from a number of collection points, including real-time database repositories, data centers, and other systems. And certainly part of the solution is knowing what data is interesting and what data is not, an acquired skill.

Says Jennifer Yates, AVP of the Networking and SQM Research program, “The idea was to become a kind of one-stop shop for gathering and collecting data for our team; creating a platform that was optimized to maximize our Researcher’s ability to innovate. When you’ve got small sub-teams going off and building their own platforms and data repositories on their own servers, that’s a lot of duplicated effort. We wanted teams to innovate together.

“Our platform drives innovation by making it not only easier to explore new ideas but to build operational prototypes that are transforming how AT&T designs and operates networks and services.”

That collaboration has paid off. Big time.

 

SQM Impact: Transforming Network Operations

The SQM Research and AT&T Operations teams have together initiated a fundamental transformation in how Network Operations manages AT&T’s networks and services. The old paradigm was this: by monitoring individual network elements and reacting to faults and performance degradations on those elements, you keep the network in good health.

The trouble is that as telecommunication networks have grown to host an unprecedented amount of traffic in support of a huge number of services, the paradigm breaks down because of the sheer scale and complexity of the systems involved. Complicating matters, services today are enabled not only by the network, but also sophisticated devices like smartphones and tablets, the data from which is carried over multiple networks, often across different service providers, and touches servers and complex applications at the end points.

Degradations in service experience can happen when any of these communication handoffs go wrong (e.g. a network outage) and also because of complex interactions between two components in the end-to-end path, such as an app interacting with a computing infrastructure entirely outside the realm of AT&T’s network.

The SQM team understood that in order to effectively manage complex services, they needed to flip the old paradigm on its head. What if network operators and engineers had a way to understand and quantify customer experience and use this information to move the focus away from individual network elements towards overall customer experience?

Ultimately, the team concluded, SQM means putting customer experience at the forefront when designing and operating networks that support a variety of services, especially when they’re as large and reliable as AT&T’s.

Says Mark Francis, Vice President of AT&T Network Planning and Operations, “The SQM Research team is inventing the technology needed to make customer experience the driving force behind managing such a massive network and set of complex services. Much of this technology simply didn’t exist before.”

Indeed. There is no off-the-shelf technology to solve many of these problems. The AT&T SQM Research team is inventing how it’s done.

The impact of such a transformation has been enormous and felt across AT&T, and more importantly, by AT&T’s customers.

So how do they do it?

The team breaks down the concept into several fundamental questions:

  • Service Monitoring: How to monitor end-to-end service experience at the massive scale of today’s networks to obtain accurate visibility at the service level across a tremendously diverse range of services?
  • Event Detection and Troubleshooting: Like finding a needle in a thousand haystacks, how to detect significant and actionable service-impacting events based on end-to-end service monitoring, where events are potentially impacting only a subset of the network locations, users, or even types of user devices? Then, how to help network engineers troubleshoot these anomalies to identify their location and underlying root cause?
  • Event Management Resource Prioritization: How to efficiently prioritize resources during an event (for example, an outage or service impairment) such that work is focused where it will have the greatest customer impact?
  • Planned Maintenance: How to ensure that high-volume day-to-day maintenance on the network (e.g. software rollouts, increasing network capacity, new service features) does not have a negative impact on the service?
  • Network Design: How to design networks in the first place that make all of these activities more efficient and more effective, while also minimizing service impact when failures occur? 

 

Enabling a New SQM Paradigm with Key Inventions

All of the following inventions are deployed and used extensively today in AT&T's Mobility network and are examples of how SQM Research innovations in cutting-edge analytics are changing the way Network Operations approaches the day-to-day business of managing networks and services.

 

Argus

The Argus platform takes its name from the 100-eyed, all-seeing giant of ancient Greek mythology. An advance in sophisticated anomaly detection, Argus watches service metrics over time across multiple dimensions, such as location, device type and mobile device operating-system versions. It detects impairments that affect service experience and alerts network operators so they can investigate and resolve problems quickly.

 

MINT

Troubleshooting service issues is a complex business. Knowing there is a service problem isn’t enough. A service path (e.g. from a customer’s handset to the application server they are communicating with on the Internet) travels over many network elements across multiple networks and networking layers that are constantly changing over time.  Knowledge of these service paths is essential to support the automated correlation needed for the many issues that can be resolved without human intervention, as well as to support manual troubleshooting.

That’s where MINT comes in. Operators use MINT when service issues become too complex and require manual intervention.

The MINT platform makes sophisticated inferences from network-element configuration files and complex routing information to identify topology and services paths and then displays the information intelligently. Operators and engineers can then see at any point how a customer’s data travels across the network. In addition, they can drill down to identify various events that occurred along the way.

 

Tower Outage Network Analyzer (TONA)

TONA, a finalist in the 2013 Mobile Excellence Awards (MEAs) for Best Breakthrough Technology, was invented to untangle the complex relationship between customer experience and cell tower outages.  Understanding this relationship is essential to prioritizing responses to outages, which are inevitable in a network the size of AT&T’s. Before TONA, network operators used network metrics to estimate the size of an outage and prioritized responses accordingly. The combined SQM/Ops team discovered, however, that there is little relationship between the size of an outage and its impact on customers. This meant that engineers were not always focused on tackling critical problems first.

The most significant technical challenge in quantifying the service impact of tower outages is the natural resiliency of the network — if a cell tower goes down, mobile devices will simply find another nearby tower if the signal strength and capacity allow. This means that even in outages involving multiple cell sites, customers don’t necessarily experience a service problem. Great for customers, but more technically challenging to quantify customer impact.  Using sophisticated algorithms invented by the SQM Research team and leveraging the vast amount of data warehoused by their computing infrastructure, TONA watches customer experience in real time and continually balances the impact of work being performed to make recommendations about where to work next in order to have the greatest beneficial impact on the largest number of customers.

Today, TONA is the basis of AT&T’s Global Networking Operations Center (GNOC) outage-management process. Its innovative analytics examine usage based on the time of day, overall customer experience, and population density, as well as a slew of real-time performance data in order to determine problem areas with greatest customer impact. Its results are used to prioritize field work in real-time in order to get the Mobility service experience healthy again in the most efficient way and with minimal customer impact.

 

Mercury

As you might imagine, keeping such a massive network running smoothly means that technicians are constantly performing maintenance: upgrading software, updating network element configuration parameters, increasing capacity, and deploying new services and features, to name a few.

But any change can impact the health of the network, potentially in unintended ways.

Mercury watches the service impact of these network changes. The moment it detects a degradation that can be accurately attributed to a network change, it raises an alarm to the network-operations team. Mercury is flexible. It can detect large issues, such as entire cell sites that have stopped sending networking traffic, and it can also detect much subtler issues.

For example, AT&T engineers often upgrade routing software nationwide. They do not take these massive rollouts lightly. Engineers perform extensive lab and field testing beforehand, aiming to catch problems that can lead to trouble. But some kinds of degradations cannot be caught in lab tests or field trials. They show up only at scale, after a significant number of network elements have been upgraded and Mercury’s sophisticated analytics can be applied across the upgraded elements.

During one such routine nationwide software upgrade not long ago, Mercury analytics identified a less-than-0.06% degradation in overall network performance. Despite being so small, AT&T simply does not tolerate such degradations. The network operations team halted the software rollout after Mercury detected the problem. They then worked with the vendor to rapidly identify a solution — all before the impact could be felt by customers.

 

SQM Research and a Next Generation Network Management Platform

A next-generation AT&T network will leverage something called Software-Defined-Networking (SDN), a revolutionary technology which separates hardware from software. SDN decouples network control from the actual network elements and enables a centralized and flexible control architecture.  Need more computing horsepower? More capacity in a particular area for a particular period of time? Need to reboot network elements? All of this will be managed through intelligent automation instead of by human beings — minimizing response times and further improving customer experience.

Christopher Rice, Vice President of Advanced Technology: "The SQM team operates at the cutting edge of networking technology. Their work is data-driven, analytics-based, and vendor-equipment independent. Future focused, for sure, but their work is applicable in both traditional and in SDN networks."

 

Closed-Loop SQM

SDN technologies will enable something the SQM team calls “closed-loop SQM.” Here’s the idea. In the old days, when a network impairment was detected, an engineer received an alert about the problem. Like an electrical circuit, the problem loop is open because the engineer must perform a sequence of manual steps to fix the problem: (1) determine if the alert is real; (2) localize the issue to a specific network element; (3) try to find the underlying root cause of the problem; and finally, (4) figure out how to address the problem. This can be time consuming.

Closed-loop SQM, on the other hand, will leverage SDN technologies to trigger automated responses to networking problems, while still keeping the focus on customer experience. This will not only improve customer experience through faster resolution of service issues, but also make the day-to-day business of managing services more efficient for Network Operations.

 

Customer First

Says Yates, "The goal here is to invent technologies that have the greatest beneficial impact on our customers. They don't necessarily see it or think about it, and that's the way it should be. It might not be as thrilling as a new kind of phone or the latest app, but it affects more people in the best way possible. It allows people to connect and communicate seamlessly on AT&T's network any time they want. That's the goal."