att_abstract={{Chronic failures are recurrent problems that often fly under the radar of operations teams because they do not affect enough users or service invocations to set off alarm thresholds. In contrast with major outages that are rare, often have a single cause, and, as a result, are typically easy to diagnose and fix quickly, chronic failures often persist in a system for days or weeks, often with multiple concurrent problems active at the same time, with complex triggers (e.g., interaction problem between multiple components), and, are elusive to diagnose. In this paper, we present Draco, a �top-down� approach to diagnosing chronic that uses a �top-down� approach to localize problems that starts from user-visible symptoms of a problem, e.g., failed calls, and drills down to identify the network-level elements and associated resource-usage metrics that are the most suggestive of the failures. Draco is able to diagnose multiple concurrent chronics in a complex distributed system even if the chronics have complex triggers and only affect few of the calls. We have deployed Draco at scale for a portion of the VoIP operations of a major ISP. We demonstrate Draco�s usefulness by provide examples of actual instances in which Draco helped operators diagnose service issues.}},
	att_authors={mh7921, kj2681, ed2527},
	att_categories={C_NSS.3, C_NSS.4, C_NSS.5, C_NSS.6, C_NSS.16},
	att_copyright_notice={{This version of the work is reprinted here with permission of IEEE for your personal use. Not for redistribution. The definitive version was published in 2012. {{, 2012-06-25}}
	att_tags={failure diagnosis},
	author={Matti Hiltunen and Kaustubh Joshi and Edward Daniels and Priya Narasimhan and Rajeev Gandhi and Soila Kavulya},
	institution={{DSN 2012}},
	title={{Draco: Statistical Diagnosis of Chronic Problems in Large Distributed Systems. }},