
700 Huron Rd
Cleveland, OH
Draco: Statistical Diagnosis of Chronic Problems in Large Distributed Systems.
Matti Hiltunen, Kaustubh Joshi, Edward Daniels, Priya Narasimhan, Rajeev Gandhi, Soila Kavulya
DSN 2012,
2012.
[BIB]
IEEE Copyright
This version of the work is reprinted here with permission of IEEE for your personal use. Not for redistribution. The definitive version was published in 2012. , 2012-06-25
{Chronic failures are recurrent problems that often fly under the radar of operations teams because they do not affect enough users or service invocations to set off alarm thresholds. In contrast with major outages that are rare, often have a single cause, and, as a result, are typically easy to diagnose and fix quickly, chronic failures often persist in a system for days or weeks, often with multiple concurrent problems active at the same time, with complex triggers (e.g., interaction problem between multiple components), and, are elusive to diagnose. In this paper, we present Draco, a �top-down� approach to diagnosing chronic that uses a �top-down� approach to localize problems that starts from user-visible symptoms of a problem, e.g., failed calls, and drills down to identify the network-level elements and associated resource-usage metrics that are the most suggestive of the failures. Draco is able to diagnose multiple concurrent chronics in a complex distributed system even if the chronics have complex triggers and only affect few of the calls. We have deployed Draco at scale for a portion of the VoIP operations of a major ISP. We demonstrate Draco�s usefulness by provide examples of actual instances in which Draco helped operators diagnose service issues.}

Practical Experiences with Chronics Discovery in Large Telecommunications Systems
Kaustubh Joshi, Edward Daniels, Matti Hiltunen, Soila P. Kavulya, Rajeev Gandhi, Priya Narasimhan
ACM SOSP Workshop on "Managing Large-Scale Systems via the Analysis of System Logs and the Appli,
2011.
[PDF]
[BIB]
ACM Copyright
(c) ACM, 2011. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in ACM SOSP Workshop on "Managing Large-Scale Systems via the Analysis of System Logs and the Appli, 2011-10-23.
{Chronics are recurrent problems that fly under the radar of operations teams because they do not perturb the system enough to set off alarms or violate service-level objectives. The discovery and diagnosis of never-before seen chronics poses new challenges as they are not detected by traditional threshold-based techniques, and many chronics can be present in a system at once, all starting and ending at different times. In this paper, we describe our experiences diagnosing chron- ics using server logs on a large telecommunications service. Our technique uses a scalable Bayesian distribution learner coupled with an information theoretic measure of distance (KL divergence), to identify the attributes that best distin- guish failed calls from successful calls. Our preliminary re- sults demonstrate the usefulness of our technique by provid- ing examples of actual instances where we helped operators discover and diagnose chronics.}