att_abstract={{Chronics are recurrent problems that fly under the radar of operations teams because they do not perturb the system enough to set off alarms or violate service-level objectives. The discovery and diagnosis of never-before seen chronics poses new challenges as they are not detected by traditional threshold-based techniques, and many chronics can be present in a system at once, all starting and ending at different times. In this paper, we describe our experiences diagnosing chron- ics using server logs on a large telecommunications service. Our technique uses a scalable Bayesian distribution learner coupled with an information theoretic measure of distance (KL divergence), to identify the attributes that best distin- guish failed calls from successful calls. Our preliminary re- sults demonstrate the usefulness of our technique by provid- ing examples of actual instances where we helped operators discover and diagnose chronics.}},
	att_authors={kj2681, ed2527, mh7921},
	att_categories={C_NSS.9, C_NSS.6, C_NSS.3},
	att_copyright_notice={{(c) ACM, 2011. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in ACM SOSP Workshop on "Managing Large-Scale Systems via the Analysis of System Logs and the Appli{{, 2011-10-23}}.
	author={Kaustubh Joshi and Edward Daniels and Matti Hiltunen and Soila P. Kavulya and Rajeev Gandhi and Priya Narasimhan},
	institution={{ACM SOSP Workshop on "Managing Large-Scale Systems via the Analysis of System Logs and the Appli}},
	title={{Practical Experiences with Chronics Discovery in Large Telecommunications Systems}},