article

When Cell Towers Fail: TowerScan Evaluates the Customer Impact

by: He Yan, Zihui Ge, Matt Osinski, Jennifer Yates, Wed Feb 27 13:46:00 EST 2013
201302_towerscan-top

 

With the world increasingly dependent on mobile communications, any interruption to service inconveniences users and can negatively affect businesses. So when cell towers fail and affect customers, network providers work hard to quickly restore service to the most customers in the shortest amount of time. But how to do this is surprisingly difficult.

Normally network managers are not looking at an single tower that failed in isolation, but rather a cluster of 10, 20 or even 30 towers that failed simultaneously. (Towers tend to fail simultaneously because they share the same power source or backhaul link—which connects to the backbone network—or because they close enough geographically to be affected by the same weather event.)

When a cluster of towers fail, network managers must choose which ones to address first and to identify those outages severe enough to warrant escalation. But how best to prioritize repairs?

Evaluating the severity of an outage based on the number of towers and population density would seem to make sense—after all, tower failures in densely populated areas have the potential to affect thousands of people—but the underlying assumption here is that customers are always negatively affected when a cell tower fails. This is simply not true. A cell tower failure does not necessarily mean customers are having problems.

In many cases, customer calls can be offloaded to a nearby tower. In fact, towers hand off calls all the time; it’s the reason people can maintain a call as they move out of range of one cell tower into the range of another. (Call handoffs on the 3G network are handled by a radio network controller, or RNC, which manages resource allocation for up to a hundred towers. See sidebar for more detail.)

To accurately quantify the customer impact of a network requires considering all customers in the immediate vicinity of the failed tower or towers.

Urban areas are often heavily blanketed by cell towers that essentially serve as backups for one another. It’s in rural areas, where redundancy is less prevalent than in urban areas, that outages can sometimes have more of an impact on customers.

Besides redundancy, other factors can mitigate the impact of an outage. Timing is one. A tower failure at 2:30 AM may have virtually no impact, and may continue to have no user impact for many hours. The day of the week also matters. A weekend tower failure might have less (or more) impact than a weekday failure, depending on whether the outage is close to work areas or to homes and parks. Special days—holidays, Mother’s day, Super Bowl day—may have patterns of their own.

Redundancy and timing as well as additional factors such as terrain are thus much more relevant to the customer impact than population density, the number of towers, and the outage duration (though population density, number of failed towers, and outage duration do have the virtue of being easy to collect).

Figuring out what’s actually happening to customers is a much harder problem.

 

What happens when a cell tower fails?

When a cell tower fails, it ceases to communicate with the network, prompting the RNC to move calls from the failed tower to another tower close to the customer and capable of taking on extra load.

From the customer perspective, there are several scenarios.

In the best one, their calls are handed off seamlessly to a nearby tower, with minimal or no disruption in service. No harm, no foul.

In other cases, especially when multiple cell towers fail simultaneously, customers may experience problems. They may be unable to access the network, they may move from a higher-speed network (e.g., LTE) to a lower-speed network (e.g., 3G), their calls may drop or not complete, or they may experience quality problems such as garbled voice quality or reduced Internet data speeds

.
 

201302_Towerscan_dk_tower-handoff
A cell tower failure may (or may not) affect customers—both those customers on the failed tower and those on nearby towers.

 

These problems are not restricted to customers on the failed tower. If calls are being offloaded to a nearby tower, customers at those towers may start to experience degraded service as their own tower suddenly takes on calls forwarded from the failed one. It’s the flip side of customers of failed towers not experiencing problems: customers on nearby, functioning towers may experience problems even though their own tower did not fail.

To accurately quantify the customer impact of a network event requires considering all customers in the immediate vicinity of a failed tower, not just those on the failed tower. The scope of the task thus broadens, perhaps considerably. It’s no longer a well-defined task of focusing on towers with known failures, but a much larger task of looking at all towers within the surrounding area of a failed tower or cluster to determine whether customers on those towers are experiencing problems.

How big of an area is impacted? How far does an impact area extend from the failed tower?

No static metric can answer these questions because there is no predetermined distance. At times of peak activity, the impact area may extend several hops away as calls get offloaded first to one cell tower and then another, rippling outward until calls hit cell towers with capacity to absorb them. At times of low activity when there is capacity to spare, the impacted area may be limited to a single hop. Additional factors, particularly terrain, may also play a role in defining the contours of the impact area.

 

Understanding the customer impact of a tower outage thus requires extensive analysis across a potentially large impact area.

The prudent approach is to examine all towers within a large radius (ten miles) to see which towers are experiencing a statistically significant increase in the number of customers handled. Such traffic spikes coming any time after a nearby tower outage could indicate the tower is taking on customers forwarded from a failed tower, perhaps to the detriment of its own customers. But an increased traffic load may not necessarily be related to a nearby outage. For some towers, traffic spikes may be normal occurrences at various periods of the day, the morning and evening rush hours being good examples. Identifying an anomalous increase requires comparing the tower’s current number of customers with the expected number of customers for both the time of day and day of the week.

The absence of a traffic spike at a particular tower is not an all-clear signal that customers are unaffected. Towers already at 95% utilization may simply pass the extra workload onto other towers. These towers so close to full capacity should also be included in the impact area since their customers are at risk of seeing degraded service.

Once the impact area is defined, another set of analyses is needed to determine the number of customers impacted, and the nature of those problems. Fortunately network data is rich in information, offering a wide range of measurements that can be analyzed to infer many different customer problems, whether the problem is one of customers being unable to access the network, slower data speeds, or calls being dropped or not completing.

Each problem, however, requires different network data and separate analyses.

The data is there; the hard part is analyzing it all in close to real-time, and interpreting it across all customers within the entire impact area to obtain an aggregate performance metric indicative of the average customer experience. (Analyzing each user, though theoretically possible, is not practical both due to the high number of customers—one cell tower alone can be handling thousands of customers—and for privacy reasons. Also, the individual customer’s experiences can be highly variable.)

The current aggregate performance must then be compared to the aggregate performance to what it would normally be expected to see for those customers within the impact area while taking into consideration the time and day. Again there are no static metrics for what constitutes average customer experience across the entire impact area. A 99% retainability rate (the percentage of calls successfully completed) might be par for the course for some cell towers, but a noticeable decrease for customers used to 99.99% retainability.

Understanding the customer impact of a tower outage thus requires extensive analysis across a potentially large impact area. It’s not something that can be done manually. It instead requires an automated system that can perform in near real time.
 

TowerScan

TowerScan began as a summer intern project less than two years ago (summer 2011) and was designed to replace the previous manual assessment method that relied heavily on population density. The goal was to develop a dynamic data-driven approach capable of quantifying the customer experience so that repair work would be based on restoring service to the highest number of customers in the shortest time.

TowerScan is essentially a series of analyses encompassing statistical time-series analysis and image extraction techniques (borrowed from computer vision) to define the size and contour of the impact area, and a separate set of time-series analysis performed on network data to understand and quantify the customer experience at each tower within the impact area.

A TowerScan review is initiated either manually or automatically if a scan of outage alerts returns a positive result (such scans are done continuously). Outages clustered together in time and proximity form a single input to a TowerScan analysis.

20130_towerscan_analysis

Left: Failed towers. Middle: An initial TowerScan analysis identifies nearby towers with significant changes in the number of users (occurring any time after a tower failure). Right: A final smoothing operation produces a single impact area that includes towers both with impacted customers and those with customers likely to be impacted.  

 

TowerScan has automated the analysis of cell-tower outages, but more importantly, TowerScan has changed the way tower outages are managed and prioritized. What was once a manual method narrowly focused on failed towers is now a broad-based, indepth analysis of network data to fully understand what is happening to all customers within an area affected by an outage. What was once manual method, with a data-driven method to help prioritize repairs based on what is best for most customers.

Results from a TowerScan analysis are more accurate and more fine-grained than those provided by the previous manual method. By running TowerScan on production outages over an extended time interval, researchers were able to see that outages with significant number of failed towers have had very little customer impact, while other outages with smaller numbers of failed towers can have fairly significant customer impact. The impact depends on how the outage occurs–if failed towers are less clustered, for example, the outage may have less impact than an equivalent outage of failed towers tightly coupled together, demonstrating that network metrics and population density are a poor approximation of customer impact.

The fine-grained information about network outages provided by TowerScan can serve more than one purpose. By incorporating it into the customer-service workflow, service representatives will be able to more precisely evaluate whether a customer-reported problem is likely caused by a tower outage or by an independent problem that just happens to be in the impact area of a failure. This will enable customer representatives to provide customers with a more accurate assessment of the problem along with a more accurate estimate as to when service will be restored.

 

201302_Towerscan_customercare

 TowerScan reports may soon annotate customer-care tickets

 

TowerScan is now deployed nationwide and is being used by AT&T’s Global Network Operations Center (GNOC), an operations facility that monitors all AT&T networks and responds to events such as cell-tower outages. Results from TowerScan analyses for the first time give GNOC analyst actual customer data useful for prioritizing and escalating repairs.

It took a little over a year for TowerScan to be elevated from a research project to a deployed system. But work continues as researchers look to expand the types of analyses performed by TowerScan to provide more even more detailed customer-impact information. One such extension will be for TowerScan to measure how customer impact changes over time, and to project hours into the future the impact of an outage. If network managers know for instance that an outage occurring at 2:00 AM outage will not impact customers for another five hours, they can concentrate on other outages that have more immediate impact. This ability to “see into the future” will give managers needed flexibility for scheduling repairs, allowing them to concentrate resources on the most pressing repairs and delaying repairs when possible.

A second focus is to understand the customer impact across the entire multilayered network, which is made up of three separate network technologies: 4G/LTE, 3G, and 2G technologies. During an outage, customers may be moved from one layer to another, but the impact is not well understood currently; managers simply do not have an accurate assessment of whether it will be more or less impactful to move an LTE customer to a remote tower that still has LTE, or to move the customer to a closer 3G tower. Once TowerScan can performing this type of cross-network analysis, it will further speed processing, quantify the customer experience at finer-granularity, and give a better understanding of the cascading impact across the 2G, 3G, and 4G/LTE networks

.

 

 












 

Redundancy is Good for the Customer

Mobility customers are normally within reach of at least two cell towers that can handle their calls.

201302_towerscan-sidebar

 
Which calls are handled by which tower is a function of where users are in relation to a cell tower and the tower’s current load. On the 3G network, resource allocation is the responsibility of the radio network controller, or RNC, which handles up to 100 towers.

Because more than one tower can provide service for a customer, the impact of cell-tower outages is often mitigated by offloading calls from the failed cell tower to a nearby tower.

This redundancy in coverage, while insulating customers from the effects of an outage, makes it harder for network managers to pinpoint the source of a problem and understand the impact of cell-tower outages. With customers moving easily from one tower to another, the problem becomes more dispersed, requiring multiple types of analysis to define the impact scope to quantitatively determine the impact severity of an outage. TowerScan was developed in part to perform the necessary analysis to quantify the customer impact under this kind of situation
 

.
 

 

About the authors

He Yan received his MS and PhD degrees in computer science from Colorado State University, Fort Collins, US, in 2009 and 2012.
He joined AT&T Research in 2012 as a senior member of technical staff at AT&T Labs - Research. (He had previously interned while a student.) His current research interests are service quality management, network management and
measurement and Internet routing.

Zihui Ge received his MS degree in computer science from Boston University in 2000 and his PhD in computer science from UMass at Amherst in 2003. He is currently a principal member of technical staff in the network management department at AT&T Labs – Research.

Matt Osinski, a Specialist- Network Support at AT&T’s Global Network Operations Center, focuses on managing network issues to help ensure network reliability, service excellence and customer satisfaction across AT&T’s networks. 
Prior to joining AT&T in 2010, Matt received his BS in Finance and Information Technology Management from Seton Hall University, South Orange, NJ.