When Cell Towers Fail: Quantifying the Customer Impact
With the world increasingly dependent on mobile communications, any interruption to service inconveniences users and can negatively affect businesses. So when cell towers fail, whether it’s a single tower failing or a cluster of towers failing simultaneously, network providers work tirelessly to quickly restore service to minimize the impact on customers.
But which customers are impacted by a cell tower failure? And what problems are they experiencing? In a dynamic, resilient network where towers hand off customers easily to other towers, the answers are not straightforward. A cell tower failure does not necessarily mean customers are having problems because those customers may simply be handed off to a nearby functioning tower (see sidebar) with minimal service disruption.
This is true whether an outage is due to a single isolated failed tower or due to a cluster of 10, 20, or 30 towers failing simultaneously. (Multiple towers may fail together if a shared resource, such as a power source or backhaul link, goes down or when towers in close proximity are hit by the same local weather event.)
While redundancy may insulate customers from the effects of an outage, it also disperses the outage and makes it harder to quantify the customer impact. It can’t be assumed that customers on a failed tower are being negatively impacted, but by the same token, customers on nearby, still-functioning towers may be impacted when their tower takes on traffic from the failed one.
None of this is captured by manual methods that assess the severity of an outage based on the number of failed towers or the local population density. Such static measurements can’t capture the complex interactions of customers being moved around different towers, especially they only consider towers that have failed and don’t take into account what is happening to customers on nearby towers.
When multiple towers fail, the problem becomes even more widespread and network managers must then prioritize repairs to address the most severe outages first.
What is required is an in-depth analysis of network data that looks not only at the failed tower but at all towers within the general vicinity to quantify the customer experience across the entire area. For this reason, AT&T Researchers created a tower-outage analyzer that performs a series of analyses of network data to take into account redundancy and other mitigating factors including timing. (A 2:30 AM outage may have virtually no impact while a 5:00 PM outage will have a great deal.) The analyzer thus provides an accurate, data-driven picture of the customer experience across the whole area impacted by an outage.
A data-driven approach to evaluating cell tower outages
The tower-outage analyzer is initiated either manually or automatically if a scan of outage alerts returns a positive result (such scans are done continuously). Outages clustered together in time and proximity form a single input to the analysis.
Static measurements can’t capture the complex interactions of customers being moved around different towers.
The first step is to determine which nearby towers might be affected by an outage and how far the impact area extends from the failed tower. There is no predetermined distance. At times of peak activity, the impact area may extend several hops away as traffic gets offloaded first to one cell tower and then another, rippling outward until hitting cell towers with capacity to absorb them. At times of low activity when there is capacity to spare, the impacted area may be limited to a single hop.
The tower-outage analyzer takes a prudent approach by examining all towers within a large (ten-mile) radius to see which towers are experiencing a statistically significant increase in the number of customers handled. Spikes in traffic, occurring any time after a nearby tower outage, could indicate the tower is taking on customers forwarded from a failed tower, perhaps to the detriment of its own customers. But an increased traffic load is not necessarily related to a nearby outage; it may instead simply signify the normal start of rush-hour traffic. To tell the difference, a statistical time-series analysis is performed on each tower in the impact area to compare the tower’s current number of customers with the expected number for both the time of day and day of the week.
By providing an accurate and fine-grained assessment of the customer experience across an entire complex outage, the tower-outage analyzer changed the way tower outages are managed and prioritized.
The absence of a traffic spike at a particular tower is not an all-clear signal that customers are unaffected. Towers already at 95% utilization may soon reach 100%, increasing the possibility of dropped calls or other problems. Also congested towers will begin offloading new traffic to nearby towers, further extending the impact area. For these reasons, the tower-outage analyzer employs image extraction techniques (borrowed from computer vision) to define the size and contour of the impact area so it includes all towers close to full capacity.
Once the impact area is defined, the tower-outage analyzer performs another set of analyses to determine the number of customers impacted, and the nature of the problems customers may be experiencing. Fortunately network data is rich in information, offering a wide range of measurements that can be analyzed to infer many different customer problems, whether it’s an inability to access the network, a demotion from a higher-speed network (e.g., LTE) to a lower-speed network (e.g., 3G), dropped or not-completed calls, slow Internet speeds, or garbled voice quality.
Each of these problems requires different network metrics and measurements. Before the tower-outage analyzer, these metrics were provided on a per-tower basis and analyzed in isolation. The analyzer however takes all these metrics and measurements and aggregates them across many towers to characterize the customer experience as a whole across the entire impact area.
The tower-outage analyzer then performs additional time-series analysis to compare how the current customer experience differs from the expected experience for that time and day. Are customers being negatively affected? Do they notice that service is degraded from the normal expected level? There are no static metrics for what constitutes “normal.” A 99% retainability rate (the percentage of calls successfully completed) might be par for the course for some cell towers, but a noticeable decrease for customers used to 99.99% retainability.
Left: Failed towers. Middle: An initial tower-scanning analysis identifies nearby towers with significant changes in the number of customers (occurring any time after a tower failure). Right: A final smoothing operation produces a single impact area that includes towers both with impacted customers and those with customers likely to be impacted.
The tower-outage analyzer in the field
The tower-outage analyzer began as a summer intern project less than two years ago (summer 2011) and was designed to replace the previous manual assessment method with an intelligent, data-driven approach to quantify the customer impact of cell-tower outages.
In a little over a year, the analyzer was deployed by AT&T’s Global Network Operations Center (GNOC), an operations facility that monitors all AT&T networks and responds to events such as cell-tower outages. Results from running the analyzer allowed GNOC analysts for the first time to accurately consider the customer experience when prioritizing and escalating repairs. By providing an accurate and fine-grained assessment of the customer experience across an entire complex outage, the tower-outage analyzer changed the way tower outages are managed and prioritized, making it more customer-focused.
The analyzer’s results also made obvious the shortcomings of the previous manual method that relied heavily on population density. When network operators and researchers ran the tower-outage analyzer on production outages over an extended time interval, they were able to see that some outages with a significant number of failed towers have previously had very little customer impact, while other outages with fewer failed towers have had fairly significant customer impact. The impact depends on various factors including how the outage occurs – if failed towers are less clustered, for example, the outage may have less impact than an equivalent outage of failed towers tightly coupled together, demonstrating that network metrics and population density provide a poor approximation of customer impact.
The fine-grained information about network outages will soon be incorporated into the customer-service workflow, allowing service representatives to more precisely evaluate whether a customer-reported problem is likely caused by a tower outage or by an independent problem that just happens to be in the impact area of a failure. Given this information, representatives will be able to more accurately predict when service will be restored.
Tower-outage analyzer reports may soon annotate customer-care tickets
Work on the tower-outage analyzer continues as researchers look to expand the types of analysis performed by the analyzer to provide more even more detailed customer-impact information. One such extension will be to project how customer impact changes over time, to determine for instance that a 2:00 AM outage will not impact customers for another five hours. Given the power to “to see into the future,” network managers will be able to more intelligently schedule repairs, prioritizing the most pressing outages above those with minimal disruption.
A second focus is to understand the customer impact across the entire multilayered network, which consists of 4G/LTE, 3G, and 2G technologies. During an outage, customers may be moved from one layer to another, but the cascading impact is not well understood currently, leaving network managers without a way to accurately assess whether it is more or less impactful to move an LTE customer to a remote tower that still has LTE, or to move the customer to a closer 3G tower. Once this type of cross-network analysis is possible, network managers will have still more insight into network complexities, and a better evaluation of how to effectively manage complex outages.
Redundancy is Good for the Customer
Mobility customers are normally within reach of at least two cell towers that can handle their calls. It’s the reason people can maintain a call or network connection as they move out of range of one cell tower into the range of another.
Which calls are handled by which tower is a function of where users are in relation to a cell tower and the tower’s current load. On the 3G network, resource allocation is the responsibility of the radio network controller, or RNC, which handles up to 100 towers.
Because more than one tower can provide service for a customer, the impact of cell-tower outages is often mitigated by offloading calls from the failed cell tower to a nearby tower.
What happens to customers when cell towers fail?
From the customer perspective, there are several scenarios.
In the best one, customers are handed off seamlessly to a nearby tower, with minimal or no disruption in service. No harm, no foul.
In other cases, especially when multiple cell towers fail simultaneously, customers may experience problems. They may be unable to access the network, they may move from a higher-speed network (e.g., LTE) to a lower-speed network (e.g., 3G), their calls may drop or not complete, or they may experience quality problems such as garbled voice quality or reduced Internet data speeds. It's important to note these problems may affect customers on nearby towers as well as those on a failed tower.
About the authors
He Yan received his MS and PhD degrees in computer science from Colorado State University, Fort Collins, US, in 2009 and 2012.
He joined AT&T Research in 2012 as a senior member of technical staff at AT&T Labs - Research. (He had previously interned while a student.) His current research interests are service quality management, network management and
measurement and Internet routing.
Zihui Ge received his MS degree in computer science from Boston University in 2000 and his PhD in computer science from UMass at Amherst in 2003. He is currently a principal member of technical staff in the network management department at AT&T Labs – Research.
Matt Osinski, a Specialist- Network Support at AT&T’s Global Network Operations Center, focuses on managing network issues to help ensure network reliability, service excellence and customer satisfaction across AT&T’s networks. Prior to joining AT&T in 2010, he received his BS in Finance and Information Technology Management from Seton Hall University, South Orange, NJ.
Jennifer Yates is an Executive Director of Technical Research at AT&T Labs, Research, leading Research’s Service and Network Management department. The department has a strong record in both longer term Research activities and in driving Research innovations into wide-scale network deployment. Her research has focused on Service Quality Management in mobility networks and IP and optical networks. She graduated with a BSc and BE(Hons) from the University of Western Australia in 1993, and with a PhD in Electrical Engineering from the University of Melbourne in 1997. Jennifer was named one of MIT Technology Review‘sTR100, Leading Young Innovators for 2003, was awarded the Victorian Photonic Networks Inaugural Achievement Award, 2004, received the AT&T Science and Technology medal (2007), and is a 2013 AT&T Fellow.