
180 Park Ave - Building 103
Florham Park, NJ
Subject matter expert in Dependability, Probabilistic Models, Distributed Systems, Cloud Computing, Data Center Management
Today, multitier distributed systems power many online computing services that society takes for granted. These include everything from online shopping sites to communications services to online content providers. Often co-located in large data centers, these complex services can be very difficult to keep running smoothly. I am interested in the online management of such systems to improve dependability, security, performance, and power efficiency. My research straddles both theory and practice, ranging from analytical techniques to model and reason about system behaviors under conditions such as faults, attacks, and changing workloads, to runtime control systems that use the models to implement adaptive behaviors such as fault recovery and self-reconfiguration. I am especially interested in techniques that use data analysis or runtime measurements to construct models with very little human intervention, and for black-box systems for which very little design information is available. My work can be loosely categorized in the areas of dependable, adaptive, and autonomic systems, and is of special relevance to shared infrastructure approaches such as cloud computing.
Before coming to AT&T, I received my Ph.D. in Computer Science from the University of Illinois at Urbana-Champaign. My dissertation work investigated the use of Partially Observable Markov Decision Process (POMDP) models to perform online fault diagnosis and recovery in multitier systems. This research combined incomplete, sometimes inaccurate, and often conflicting reports from multiple monitoring systems to to make cost-benefit tradeoffs and choose recovery mechanisms that would bring a system back up with minimal disruption after a failure had occurred.

CloudTops: Latency aware placement of Virtual Desktops in Distributed Cloud Infrastructures
Matti Hiltunen, Kaustubh Joshi, Richard Schlichting, Nishio Yamada, Toshiyuki Moritsu
CLOSER 2013: 3rd International Conference on Cloud Computing and Services Science,
2013.
[PDF]
[BIB]
SciTePress Copyright
The definitive version was published in 2013. , 2013-05-08
{Latency sensitive interactive applications such as virtual desktops for enterprise workers are slated to be important driving applications for next generation cloud infrastructures. Determining where to geographically place desktop VMs in a globally distributed cloud so as to optimize user-perceived performance is an important and challenging problem. Historically, the performance of thin-client-based systems has been predominantly characterized in terms of the front-end network link between the thin client and the desktop. In this paper, we show that for typical enterprise applications, back-end network connectivity to the filesystems and applications that support the desktop can be equally important, and that the optimal balance between the front-end and back-end links depends on the precise workload. To help make dynamic decisions about desktop VM placement, we propose a per-user model that can be used to automatically construct user profiles, and to predict the optimal location for a user’s desktop based on their past and current usage patterns. Using experimental evaluation of several typical Enterprise applications, we show that our methodology can accurately predict which of many distributed data centers to use for a particular user’s workload even if details of the precise applications being used are not known.}

Jettison: Efficient Idle Desktop Consolidation with Partial VM Migration
Nilton Bila, Eyal Lara, Kaustubh Joshi, Andres Lagar-Cavilla, Matti Hiltunen, Mahadev Satyanarayanan
Proceedings of EuroSys 2012,
Proceedings of EuroSys 2012. ACM.,
2012.
[PDF]
[BIB]
ACM Copyright
(c) ACM, 2012. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in Proceedings of EuroSys 2012. ACM, 2012-04-10.
Idle desktop systems are frequently left powered, often be- cause of applications that maintain network presence or to enable potential remote access. Unfortunately, an idle PC consumes up to 60% of its peak power. Solutions have been proposed that perform consolidation of idle desktop virtual machines. However, desktop VMs are often large requiring gigabytes of memory. Consolidating such VMs, creates bulk network transfers lasting in the order of minutes, and utilizes server memory inefficiently. When multiple VMs migrate simultaneously, each VM�s experienced migration latency grows, and this limits the use of VM consolidation to envi- ronments in which only a few daily migrations are expected for each VM. This paper introduces Partial VM Migration, a technique that transparently migrates only the working set of an idle VM. Jettison, our partial VM migration prototype, can deliver 85% to 104% of the energy savings of full VM migration, while using less than 10% as much network re- sources, and providing migration latencies that are two to three orders of magnitude smaller.

Draco: Statistical Diagnosis of Chronic Problems in Large Distributed Systems.
Matti Hiltunen, Kaustubh Joshi, Edward Daniels, Priya Narasimhan, Rajeev Gandhi, Soila Kavulya
DSN 2012,
2012.
[BIB]
IEEE Copyright
This version of the work is reprinted here with permission of IEEE for your personal use. Not for redistribution. The definitive version was published in 2012. , 2012-06-25
{Chronic failures are recurrent problems that often fly under the radar of operations teams because they do not affect enough users or service invocations to set off alarm thresholds. In contrast with major outages that are rare, often have a single cause, and, as a result, are typically easy to diagnose and fix quickly, chronic failures often persist in a system for days or weeks, often with multiple concurrent problems active at the same time, with complex triggers (e.g., interaction problem between multiple components), and, are elusive to diagnose. In this paper, we present Draco, a �top-down� approach to diagnosing chronic that uses a �top-down� approach to localize problems that starts from user-visible symptoms of a problem, e.g., failed calls, and drills down to identify the network-level elements and associated resource-usage metrics that are the most suggestive of the failures. Draco is able to diagnose multiple concurrent chronics in a complex distributed system even if the chronics have complex triggers and only affect few of the calls. We have deployed Draco at scale for a portion of the VoIP operations of a major ISP. We demonstrate Draco�s usefulness by provide examples of actual instances in which Draco helped operators diagnose service issues.}

Using CPU Gradients for Performance-aware Energy Conservation in Multitier Systems
Kaustubh Joshi, Matti Hiltunen, Richard Schlichting, Shuyi Chen, William Sanders
Sustainable Computing: Informatics and Systems ,
2011.
[PDF]
[BIB]
Elsevier Copyright
The definitive version was published in Sustainable Computing: Informatics and Systems. , 2011-05-01
{Dynamic voltage and frequency scaling (DVFS) and virtual machine (VM) based server consolidation are techniques that hold promise for energy conservation, but can also have adverse impacts on system performance. For the responsiveness-sensitive multitier applications running in today�s data centers, queuing models should ideally be used to predict the impact of CPU scaling on response time, to allow appropriate runtime trade-offs between performance and energy use. In practice, however, such models are difficult to construct and thus are often abandoned for ad-hoc solutions. In this paper, an alternative measurement-based approach that predicts the impacts without requiring detailed application knowledge is presented. The approach uses a new set of metrics, the CPU gradients, that can be automatically measured on a running system using lightweight and nonintrusive CPU perturbations. The practical feasibility of the approach is demonstrated using extensive experiments on multiple multitier applications, and it is shown that simple energy controllers can use gradient predictions to derive as much as 57% energy savings while still meeting response time constraints.}

Traffic Backfilling: Subsidizing Lunch for Delay-Tolerant Applications in UMTS Networks
Horacio Lagar , Kaustubh Joshi, Alexander Varshavsky, Jeffrey Bickford, Darwin Parra
ACM MobiHeld workshop 2011,
2011.
[PDF]
[BIB]
ACM Copyright
(c) ACM, 2011. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in ACM MobiHeld workshop 2011 , 2011-10-23.
{Mobile application developers pay little attention to the interactions between applications and the cellular net- work carrying their traffic. This results in wastage of de- vice energy and network signaling resources. We place part of the blame on mobile OSes: they do not expose adequate interfaces through which applications can in- teract with the network. We propose traffic backfilling, a technique in which delay-tolerant traffic is opportunis- tically transmitted by the OS using resources left over by the naturally occurring bursts caused by interactive traffic. Backfilling presents a simple interface with two classes of traffic, and grants the OS and network large flexibility to maximize the use of network resources and reduce device energy consumption. Using device traces and network data from a major US carrier, we demon- strate a large opportunity for traffic backfilling.}

Practical Experiences with Chronics Discovery in Large Telecommunications Systems
Kaustubh Joshi, Edward Daniels, Matti Hiltunen, Soila P. Kavulya, Rajeev Gandhi, Priya Narasimhan
ACM SOSP Workshop on "Managing Large-Scale Systems via the Analysis of System Logs and the Appli,
2011.
[PDF]
[BIB]
ACM Copyright
(c) ACM, 2011. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in ACM SOSP Workshop on "Managing Large-Scale Systems via the Analysis of System Logs and the Appli, 2011-10-23.
{Chronics are recurrent problems that fly under the radar of operations teams because they do not perturb the system enough to set off alarms or violate service-level objectives. The discovery and diagnosis of never-before seen chronics poses new challenges as they are not detected by traditional threshold-based techniques, and many chronics can be present in a system at once, all starting and ending at different times. In this paper, we describe our experiences diagnosing chron- ics using server logs on a large telecommunications service. Our technique uses a scalable Bayesian distribution learner coupled with an information theoretic measure of distance (KL divergence), to identify the attributes that best distin- guish failed calls from successful calls. Our preliminary re- sults demonstrate the usefulness of our technique by provid- ing examples of actual instances where we helped operators discover and diagnose chronics.}

Kaleidoscope: Cloud Micro-Elasticity via VM State Coloring
Horacio Lagar , Kaustubh Joshi, Matti Hiltunen, Roy Bryant, Eyal Lara, Alexey Tumanov, Olga Irzak, Adin Scannell
Eurosys 2011, ACM European Conference on Computer Systems,
2011.
[PDF]
[BIB]
ACM Copyright
(c) ACM, 2011. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in ACM European Conference on Computer Systems - Eurosys 2011, 2011-04-10.
{We introduce cloud micro-elasticity, a new model for cloud Virtual Machine (VM) allocation and management. Currentcloud users over-provision long-lived VMs with large memory footprints, to better absorb load spikes and to conserve performance-sensitive caches. Instead, we achieve elasticity by swiftly cloning VMs into many transient, short-lived, fractional worker clones, to multiplex physical resources at a much finer granularity. The memory of micro-elastic clones is a logical replica of all parent VM state including caches, and its footprint is a fraction of the nominal maximum proportional to the workload. We enable micro-elasticity through a novel technique dubbed VM state coloring, which classifies VM memory into sets of semantically-related regions, and optimizes the propagation, allocation and deduplication of these regions. Using coloring we build Kaleidoscope and empirically demonstrate its ability to create micro-elastic cloned servers.We model the impact of microelasticity on a demand dataset from a hosting provider, and show that fine-grained multiplexing yields infrastructure reductions of 30% relative to state-of-the art techniques for managing elastic clouds.}

FloGuard: Cost-aware Systemwide Intrusion Defense via Online Forensics and On-demand IDS Deployment
Kaustubh Joshi, Saman Aliari Zonouz, William H. Sanders
Proceedings of the 30th International Conference on Computer Safety, Reliability and Security (SAFECOMP 2011),
The 30th International Conference on Computer Safety, Reliability and Security. SAFECOMP 2011.,
2011.
[PDF]
[BIB]
Springer Copyright
The definitive version was published in The 30th International Conference on Computer Safety, Reliability and Security. SAFECOMP 2011.
Name of Publisher copyright will be transferred to Springer , 2011-09-19
Detectingintrusionsearlyenoughcanbeachallengingandexpensive endeavor. While intrusion detection techniques exist for many types of vulnerabil- ities, deploying them all to catch the small number of vulnerability exploitations that might actually exist for a given system is not cost-effective. In this paper, we present FloGuard, an on-line intrusion forensics and on-demand detector selec- tion framework that provides systems with the ability to deploy the right detec- tors dynamically in a cost-effective manner when the system is threatened by an exploit. FloGuard relies on often easy-to-detect symptoms of attacks, e.g., par- ticipation in a botnet, and works backwards by iteratively deploying off-the-shelf detectors closer to the initial attack vector. The experiments using the EggDrop bot and systems with real vulnerabilities show that FloGuard can efficiently lo- calize the attack origins even for unknown vulnerabilities, and can judiciously choose appropriate detectors to prevent them from being exploited in the future.

An Exploration of L2 Cache Covert Channels in Virtualized Environments
Yunjing Xu, Michael Bailey, Farnam Jahanian, Kaustubh Joshi, Matti Hiltunen, Richard Schlichting
Proceedings of the ACM Cloud Computing Security Workshop (CCSW),
CCSW 2011: The ACM Cloud Computing Security Workshop in conjunction with the 17th ACM Conference on ,
ACM,
pp To appear.,
2011.
[PDF]
[BIB]
ACM Copyright
(c) ACM, 2011. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in CCSW 2011: The ACM Cloud Computing Security Workshop in conjunction with the 17th ACM Conference on , 2011-10-21.
Recent exploration into the unique security challenges of cloud computing have shown that when virtual machines belonging to different customers share the same physical machine, new forms of cross-VM covert channel communica- tion arise. In this paper, we explore one of these threats, L2 cache covert channels, and demonstrate the limits of these this threat by providing a quantification of the channel bit rates and an assessment of its ability to do harm. Through progressively refining models of cross-VM covert channels from the derived maximums, to implementable channels in the lab, and finally in Amazon EC2 itself we show how a variety of factors impact our ability to create effective chan- nels. While we demonstrate a covert channel with consider- ably higher bit rate than previously reported, we assess that even at such improved rates, the harm of data exfiltration from these channels is still limited to the sharing of small, if important, secrets such as private keys.

The Case for Energy-Oriented Partial Desktop Migration
Kaustubh Joshi, Horacio Lagar-Cavilla, Matti Hiltunen, Nilton Bila, Eyal Lara, Mahadev Satyanarayanan
2nd USENIX Workshop on Hot Topics in Cloud Computing ,
2010.
[BIB]
USENIX Copyright
The definitive version was published in Proceedings of WOSN 2010, Usenix. , 2010-06-22
{Office and home environments are increasingly crowded with personal computers. Even though these computers see little use in the course of the day, they often remain powered, even when idle. Leaving idle PCs running is not only wasteful, but with rising energy costs it is in- creasingly more expensive. We propose partial migration of idle desktop sessions into the cloud to achieve energy- proportional computing. Partial migration only propa- gates the small footprint of state that will be needed dur- ing idle period execution, and returns the session to the PC when it is no longer idle. We show that this approach can reduce energy usage of an idle desktop by up to 50% over an hour and by up to 69% overnight. We also show that idle desktop sessions have small working sets, up to an order of magnitude smaller than their allocated mem- ory, enabling significant consolidation ratios.}

Probabilistic Model-Driven Recovery in Distributed Systems
Kaustubh Joshi, Matti Hiltunen, Richard Schlichting, William Sanders
IEEE Transactions on Dependable and Secure Computing,
2010.
[BIB]
{Automatic system monitoring and recovery has the potential to provide effective, low-cost ways to improve dependability in distributed software systems. However, automating recovery is challenging in practice because accurate fault diagnosis is difficult given the common monitoring tools and techniques with low fault coverage, poor fault localization, detection delays, and false positives. In this paper, we present a holistic model-based approach that overcomes these challenges and enables automatic recovery in distributed systems. To do so, it uses theoretically sound techniques including Bayesian estimation and Markov decision theory to provide controllers that choose good, if not optimal, recovery actions according to a user-defined optimization criteria. By combining monitoring and recovery, the approach realizes benefits that could not have been obtained by using them in isolation. We experimentally validate our framework by fault injection on realistic e-commerce systems.}

CPU Gradients: Performance-aware Energy Conservation in Multitier Systems
Shuyi Chen, Kaustubh Joshi, Matti Hiltunen, Richard Schlichting, William Sanders
Proceedings of the 1st IEEE International Green Computing Conference,
IEEE International Green Computing Conference,
2010.
[BIB]
Dynamic voltage and frequency scaling (DVFS) and virtual machine (VM) based server consolidation are techniques that hold promise for energy conservation, but can also have adverse impacts on system performance. For the responsiveness-sensitive multitier applications running in today's data centers, queuing models should ideally be used
to predict the impact of CPU scaling on response time, to allow appropriate runtime trade-offs between performance and energy use. In practice, however, such models are difficult to construct and thus are often abandoned for ad-hoc solutions. In this paper, an alternative measurement-based approach that predicts the impacts without requiring detailed application knowledge is presented. The approach uses a new set of metrics, the CPU gradients, that can be automatically measured on a running system using lightweight and nonintrusive CPU perturbations. The practical feasibility of the approach is demonstrated using extensive experiments on multiple multitier applications, and it is shown that simple energy controllers can use gradient predictions to derive as much as 50% energy
savings while still meeting response time constraints.
Performance Aware Regeneration in Virtualized Multitier Applications
Kaustubh Joshi, Matti Hiltunen, Jung Gueyoung
2009.
[PDF]
[BIB]
Dependability in the Cloud: Challenges and Opportunities
Kaustubh Joshi, Joseph Weinman, Guy Bunker, Farnam Jahanian, Aad Moorsel
2009.
[PDF]
[BIB]
Blackbox Prediction of the Impact of DVFS on End-toEnd Performance of Multitier Systems
Matti Hiltunen, Kaustubh Joshi, Richard Schlichting, Shuyi Chen, Sanders William
2009.
[PDF]
[BIB]
A Cost-Sensitive Adaptation Engine for Server Consolidation of Multitier Applications
Matti Hiltunen, Kaustubh Joshi, Richard Schlichting, Gueyoung Jung, Calton Pu
2009.
[PDF]
[BIB]
Link Gradients: Predicting the Impact of Network Latency on Multi-Tier Applications
Kaustubh Joshi, Matti Hiltunen, Richard Schlichting, Shuyi Chen, Sanders William
2008.
[PDF]
[BIB]
Detecting Hidden Shared Dependencies via Covert Channels
Kaustubh Joshi
2008.
[PDF]
[BIB]
Generating Adaptation Policies for Multi-Tier Applications in Consolidated Server Environments
Matti Hiltunen, Richard Schlichting, Kaustubh Joshi, Gueyoung Jung, Calton Pu
2007.
[PDF]
[BIB]
An Off-Line Approach for Generating On-Line Adaptation Policies
Matti Hiltunen, Kaustubh Joshi, Richard Schlichting, Gueyoung Jung, Calton Pu
2007.
[PDF]
[BIB]
Performability Optimization using Linear Bounds of Partially Observable Markov Decision Processes
Kaustubh Joshi, Matti Hiltunen, William Sanders
2005.
[PDF]
[BIB]