
180 Park Ave - Building 103
Florham Park, NJ
Jettison: Efficient Idle Desktop Consolidation with Partial VM Migration
Nilton Bila, Eyal Lara, Kaustubh Joshi, Andres Lagar-Cavilla, Matti Hiltunen, Mahadev Satyanarayanan
Proceedings of EuroSys 2012,
Proceedings of EuroSys 2012. ACM.,
2012.
[PDF]
[BIB]
ACM Copyright
(c) ACM, 2012. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in Proceedings of EuroSys 2012. ACM, 2012-04-10.
Idle desktop systems are frequently left powered, often be- cause of applications that maintain network presence or to enable potential remote access. Unfortunately, an idle PC consumes up to 60% of its peak power. Solutions have been proposed that perform consolidation of idle desktop virtual machines. However, desktop VMs are often large requiring gigabytes of memory. Consolidating such VMs, creates bulk network transfers lasting in the order of minutes, and utilizes server memory inefficiently. When multiple VMs migrate simultaneously, each VM�s experienced migration latency grows, and this limits the use of VM consolidation to envi- ronments in which only a few daily migrations are expected for each VM. This paper introduces Partial VM Migration, a technique that transparently migrates only the working set of an idle VM. Jettison, our partial VM migration prototype, can deliver 85% to 104% of the energy savings of full VM migration, while using less than 10% as much network re- sources, and providing migration latencies that are two to three orders of magnitude smaller.

Draco: Statistical Diagnosis of Chronic Problems in Large Distributed Systems.
Matti Hiltunen, Kaustubh Joshi, Edward Daniels, Priya Narasimhan, Rajeev Gandhi, Soila Kavulya
DSN 2012,
2012.
[BIB]
IEEE Copyright
This version of the work is reprinted here with permission of IEEE for your personal use. Not for redistribution. The definitive version was published in 2012. , 2012-06-25
{Chronic failures are recurrent problems that often fly under the radar of operations teams because they do not affect enough users or service invocations to set off alarm thresholds. In contrast with major outages that are rare, often have a single cause, and, as a result, are typically easy to diagnose and fix quickly, chronic failures often persist in a system for days or weeks, often with multiple concurrent problems active at the same time, with complex triggers (e.g., interaction problem between multiple components), and, are elusive to diagnose. In this paper, we present Draco, a �top-down� approach to diagnosing chronic that uses a �top-down� approach to localize problems that starts from user-visible symptoms of a problem, e.g., failed calls, and drills down to identify the network-level elements and associated resource-usage metrics that are the most suggestive of the failures. Draco is able to diagnose multiple concurrent chronics in a complex distributed system even if the chronics have complex triggers and only affect few of the calls. We have deployed Draco at scale for a portion of the VoIP operations of a major ISP. We demonstrate Draco�s usefulness by provide examples of actual instances in which Draco helped operators diagnose service issues.}

Using CPU Gradients for Performance-aware Energy Conservation in Multitier Systems
Kaustubh Joshi, Matti Hiltunen, Richard Schlichting, Shuyi Chen, William Sanders
Sustainable Computing: Informatics and Systems ,
2011.
[PDF]
[BIB]
Elsevier Copyright
The definitive version was published in Sustainable Computing: Informatics and Systems. , 2011-05-01
{Dynamic voltage and frequency scaling (DVFS) and virtual machine (VM) based server consolidation are techniques that hold promise for energy conservation, but can also have adverse impacts on system performance. For the responsiveness-sensitive multitier applications running in today�s data centers, queuing models should ideally be used to predict the impact of CPU scaling on response time, to allow appropriate runtime trade-offs between performance and energy use. In practice, however, such models are difficult to construct and thus are often abandoned for ad-hoc solutions. In this paper, an alternative measurement-based approach that predicts the impacts without requiring detailed application knowledge is presented. The approach uses a new set of metrics, the CPU gradients, that can be automatically measured on a running system using lightweight and nonintrusive CPU perturbations. The practical feasibility of the approach is demonstrated using extensive experiments on multiple multitier applications, and it is shown that simple energy controllers can use gradient predictions to derive as much as 57% energy savings while still meeting response time constraints.}

Practical Experiences with Chronics Discovery in Large Telecommunications Systems
Kaustubh Joshi, Edward Daniels, Matti Hiltunen, Soila P. Kavulya, Rajeev Gandhi, Priya Narasimhan
ACM SOSP Workshop on "Managing Large-Scale Systems via the Analysis of System Logs and the Appli,
2011.
[PDF]
[BIB]
ACM Copyright
(c) ACM, 2011. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in ACM SOSP Workshop on "Managing Large-Scale Systems via the Analysis of System Logs and the Appli, 2011-10-23.
{Chronics are recurrent problems that fly under the radar of operations teams because they do not perturb the system enough to set off alarms or violate service-level objectives. The discovery and diagnosis of never-before seen chronics poses new challenges as they are not detected by traditional threshold-based techniques, and many chronics can be present in a system at once, all starting and ending at different times. In this paper, we describe our experiences diagnosing chron- ics using server logs on a large telecommunications service. Our technique uses a scalable Bayesian distribution learner coupled with an information theoretic measure of distance (KL divergence), to identify the attributes that best distin- guish failed calls from successful calls. Our preliminary re- sults demonstrate the usefulness of our technique by provid- ing examples of actual instances where we helped operators discover and diagnose chronics.}

Mining large distributed log-data in near real-time
Stefan Weigert, Matti Hiltunen, Christof Fetzer
SLAML: Workshop on Managing Large-Scale Systems via the Analysis of System Logs and the Application ,
2011.
[PDF]
[BIB]
ACM Copyright
(c) ACM, 2011. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in SLAML: Workshop on Managing Large-Scale Systems via the Analysis of System Logs and the Application, 2011-10-23.
{Analyzing huge amounts of log-data is often a difficult task, especially if it has to be done in real-time (e.g., fraud detection) or when large amounts of stored data are required for the analysis. A data structure which is often used in log analysis are graphs. Examples are clique analysis and communities of interest (COI). However, little attention has been paid to large distributed graphs that allow a high throughput of updates with very low latency. In this paper, we present a distributed graph mining system that is able to process around 39 million log entries per second on a 50 node cluster while providing processing latencies below 10ms. We validate our approach by presenting two example applications, namely telephony fraud detection and internet attack detection. A thorough evaluation proves the scalability and near real-time properties of our system.}

Kaleidoscope: Cloud Micro-Elasticity via VM State Coloring
Horacio Lagar , Kaustubh Joshi, Matti Hiltunen, Roy Bryant, Eyal Lara, Alexey Tumanov, Olga Irzak, Adin Scannell
Eurosys 2011, ACM European Conference on Computer Systems,
2011.
[PDF]
[BIB]
ACM Copyright
(c) ACM, 2011. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in ACM European Conference on Computer Systems - Eurosys 2011, 2011-04-10.
{We introduce cloud micro-elasticity, a new model for cloud Virtual Machine (VM) allocation and management. Currentcloud users over-provision long-lived VMs with large memory footprints, to better absorb load spikes and to conserve performance-sensitive caches. Instead, we achieve elasticity by swiftly cloning VMs into many transient, short-lived, fractional worker clones, to multiplex physical resources at a much finer granularity. The memory of micro-elastic clones is a logical replica of all parent VM state including caches, and its footprint is a fraction of the nominal maximum proportional to the workload. We enable micro-elasticity through a novel technique dubbed VM state coloring, which classifies VM memory into sets of semantically-related regions, and optimizes the propagation, allocation and deduplication of these regions. Using coloring we build Kaleidoscope and empirically demonstrate its ability to create micro-elastic cloned servers.We model the impact of microelasticity on a demand dataset from a hosting provider, and show that fine-grained multiplexing yields infrastructure reductions of 30% relative to state-of-the art techniques for managing elastic clouds.}

DarkNOC: Dashboard for Honeypot Management
Bertrand Sobesto, Michel Cukier, Matti Hiltunen, David Kormann, Gregory Vesonder, Robin Berthier
USENIX LISA'11: 25th Large Installation System Administration Conference,
2011.
[PDF]
[BIB]
USENIX Copyright
The definitive version was published in USENIX LISA'11: 25th Large Installation System Administration Conference, Usenix. , 2011-12-04
{Protecting computer and information systems from security attacks is becoming an increasingly important task for system administrators. Honeypots are a technology often used to detect attacks and collect information about techniques and targets (e.g., services, ports, operating systems) of attacks. However, managing a large and complex honeynet of honeypots becomes a challenge in itself given the amount of data collected as well as the risk that the honeypots may themselves become infected and start attacking other machines. In this paper, we present DarkNOC, a management and monitoring tool for complex honeynets consisting of different types of honeypots as well as other data collection devices. DarkNOC has been actively used to manage a honeynet consisting of multiple subnets and hundreds of IP addresses. This paper describes the architecture and a number of case studies demonstrating the use of the tool.}

Community-based analysis of netflow for early detection of security incidents
Matti Hiltunen, Stefan Weigert, Christof Fetzer
Usenix LISA,
2011.
[PDF]
[BIB]
USENIX Copyright
The definitive version was published in Proceedings of LISA 2011, Usenix. , 2011-12-04
{Detection and remediation of security incidents (e.g., attacks, compromised machines, policy violations) is an increasingly important task of system administrators. While numerous tools and techniques are available (e.g., Snort, nmap, netflow), novel attacks and low-grade events may still be hard to detect in a timely manner. In this paper, we present a novel approach for detecting stealthy, low-grade security incidents by utilizing information across a community of organizations (e.g., banking industry, energy generation and distribution industry, governmental organizations in a specific country, etc). The approach uses netflow,
a commonly available non-intrusive data source, analyzes communication to/from the community, and alerts the community members when suspicious activity is detected. A community-based detection has the ability to detect incidents that would fall below local detection thresholds while maintaining the number of alerts at a manageable level for each day. }

An Exploration of L2 Cache Covert Channels in Virtualized Environments
Yunjing Xu, Michael Bailey, Farnam Jahanian, Kaustubh Joshi, Matti Hiltunen, Richard Schlichting
Proceedings of the ACM Cloud Computing Security Workshop (CCSW),
CCSW 2011: The ACM Cloud Computing Security Workshop in conjunction with the 17th ACM Conference on ,
ACM,
pp To appear.,
2011.
[PDF]
[BIB]
ACM Copyright
(c) ACM, 2011. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in CCSW 2011: The ACM Cloud Computing Security Workshop in conjunction with the 17th ACM Conference on , 2011-10-21.
Recent exploration into the unique security challenges of cloud computing have shown that when virtual machines belonging to different customers share the same physical machine, new forms of cross-VM covert channel communica- tion arise. In this paper, we explore one of these threats, L2 cache covert channels, and demonstrate the limits of these this threat by providing a quantification of the channel bit rates and an assessment of its ability to do harm. Through progressively refining models of cross-VM covert channels from the derived maximums, to implementable channels in the lab, and finally in Amazon EC2 itself we show how a variety of factors impact our ability to create effective chan- nels. While we demonstrate a covert channel with consider- ably higher bit rate than previously reported, we assess that even at such improved rates, the harm of data exfiltration from these channels is still limited to the sharing of small, if important, secrets such as private keys.

The Case for Energy-Oriented Partial Desktop Migration
Kaustubh Joshi, Horacio Lagar-Cavilla, Matti Hiltunen, Nilton Bila, Eyal Lara, Mahadev Satyanarayanan
2nd USENIX Workshop on Hot Topics in Cloud Computing ,
2010.
[BIB]
USENIX Copyright
The definitive version was published in Proceedings of WOSN 2010, Usenix. , 2010-06-22
{Office and home environments are increasingly crowded with personal computers. Even though these computers see little use in the course of the day, they often remain powered, even when idle. Leaving idle PCs running is not only wasteful, but with rising energy costs it is in- creasingly more expensive. We propose partial migration of idle desktop sessions into the cloud to achieve energy- proportional computing. Partial migration only propa- gates the small footprint of state that will be needed dur- ing idle period execution, and returns the session to the PC when it is no longer idle. We show that this approach can reduce energy usage of an idle desktop by up to 50% over an hour and by up to 69% overnight. We also show that idle desktop sessions have small working sets, up to an order of magnitude smaller than their allocated mem- ory, enabling significant consolidation ratios.}

Probabilistic Model-Driven Recovery in Distributed Systems
Kaustubh Joshi, Matti Hiltunen, Richard Schlichting, William Sanders
IEEE Transactions on Dependable and Secure Computing,
2010.
[BIB]
{Automatic system monitoring and recovery has the potential to provide effective, low-cost ways to improve dependability in distributed software systems. However, automating recovery is challenging in practice because accurate fault diagnosis is difficult given the common monitoring tools and techniques with low fault coverage, poor fault localization, detection delays, and false positives. In this paper, we present a holistic model-based approach that overcomes these challenges and enables automatic recovery in distributed systems. To do so, it uses theoretically sound techniques including Bayesian estimation and Markov decision theory to provide controllers that choose good, if not optimal, recovery actions according to a user-defined optimization criteria. By combining monitoring and recovery, the approach realizes benefits that could not have been obtained by using them in isolation. We experimentally validate our framework by fault injection on realistic e-commerce systems.}

Nfsight: NetFlow-based Network Awareness Tool
Robin Berthier, MIchel Cukier, Matti Hiltunen, David Kormann, Gregory Vesonder, Daniel Sheleheda
Proceedings of the 24th Large Installation System Administration Conference (LISA '10),
24th Large Installation System Administration Conference (USENIX LISA),
2010.
[PDF]
[BIB]
USENIX Copyright
The definitive version was published in LISAI '10., 2010-11-07
Network awareness is highly critical for network and security administrators. It enables informed planning and management of network resources, as well as detection and a comprehensive understanding of malicious activity. It requires a set of tools to efficiently collect, process and represent network data. While many of such tools already exist, there is a lack of a flexible and practical solution to visualize network activity at various granularities, and to quickly gain insights about the status of net- work assets. To address this issue, we developed Nfsight, a Netflow processing and visualization application designed to offer a comprehensive network awareness solution. Nfsight leverages the use of bidirectional flows to provide client/server identification and intrusion detection capabilities. We present in this paper the internal architecture of Nfsight, the evaluation of the service and intrusion detection algorithms. We illustrate the contributions of Nfsight through several case studies conducted by security administrators on a large campus network.

CPU Gradients: Performance-aware Energy Conservation in Multitier Systems
Shuyi Chen, Kaustubh Joshi, Matti Hiltunen, Richard Schlichting, William Sanders
Proceedings of the 1st IEEE International Green Computing Conference,
IEEE International Green Computing Conference,
2010.
[BIB]
Dynamic voltage and frequency scaling (DVFS) and virtual machine (VM) based server consolidation are techniques that hold promise for energy conservation, but can also have adverse impacts on system performance. For the responsiveness-sensitive multitier applications running in today's data centers, queuing models should ideally be used
to predict the impact of CPU scaling on response time, to allow appropriate runtime trade-offs between performance and energy use. In practice, however, such models are difficult to construct and thus are often abandoned for ad-hoc solutions. In this paper, an alternative measurement-based approach that predicts the impacts without requiring detailed application knowledge is presented. The approach uses a new set of metrics, the CPU gradients, that can be automatically measured on a running system using lightweight and nonintrusive CPU perturbations. The practical feasibility of the approach is demonstrated using extensive experiments on multiple multitier applications, and it is shown that simple energy controllers can use gradient predictions to derive as much as 50% energy
savings while still meeting response time constraints.
Performance Aware Regeneration in Virtualized Multitier Applications
Kaustubh Joshi, Matti Hiltunen, Jung Gueyoung
2009.
[PDF]
[BIB]
Blackbox Prediction of the Impact of DVFS on End-toEnd Performance of Multitier Systems
Matti Hiltunen, Kaustubh Joshi, Richard Schlichting, Shuyi Chen, Sanders William
2009.
[PDF]
[BIB]
A Cost-Sensitive Adaptation Engine for Server Consolidation of Multitier Applications
Matti Hiltunen, Kaustubh Joshi, Richard Schlichting, Gueyoung Jung, Calton Pu
2009.
[PDF]
[BIB]
Link Gradients: Predicting the Impact of Network Latency on Multi-Tier Applications
Kaustubh Joshi, Matti Hiltunen, Richard Schlichting, Shuyi Chen, Sanders William
2008.
[PDF]
[BIB]
Generating Adaptation Policies for Multi-Tier Applications in Consolidated Server Environments
Matti Hiltunen, Richard Schlichting, Kaustubh Joshi, Gueyoung Jung, Calton Pu
2007.
[PDF]
[BIB]
An Off-Line Approach for Generating On-Line Adaptation Policies
Matti Hiltunen, Kaustubh Joshi, Richard Schlichting, Gueyoung Jung, Calton Pu
2007.
[PDF]
[BIB]
Performability Optimization using Linear Bounds of Partially Observable Markov Decision Processes
Kaustubh Joshi, Matti Hiltunen, William Sanders
2005.
[PDF]
[BIB]
Peer-to-Peer Error Recovery for Hybrid Satellite-Terrestrial Networks
Matti Hiltunen, Richard Schlichting, Vinay Vaishampayan, Eric Weigle, Andrew . Chien
2005.
[PDF]
[BIB]
Quantifying The Impact Of Network Latency On The End-To-End Response Time Of Distributed Applications,
Tue Dec 06 16:02:20 EST 2011
A method for measuring system response sensitivity, using live traffic and an analysis that converts randomly arriving stimuli and reactions to the stimuli to mean measures over chosen intervals, thereby creating periodically occurring samples that are processed. The system is perturbed in a chosen location of the system in a manner that is periodic with frequency p, and the system's response to arriving stimuli is measured at frequency p. The perturbation, illustratively, is with a square wave pattern.
System And Method For Enforcing Application Security Policies Using Authenticated System Calls,
Tue Mar 22 16:01:58 EDT 2011
Disclosed is an approach to system call monitoring in which authenticated system calls from an application are easily verified by an operating system kernel. The authenticated system call may be a system call augmented with extra arguments, which specify the policy for that call as well as a cryptographic message authentication code (MAC) that guarantees the integrity of the policy and the system call arguments. This extra information is used by the operating system kernel to verify the system call with little processing overhead. Versions of the applications in which regular system calls have been replaced by authenticated calls are generated automatically by a trusted installer program that reads the application binary, uses static analysis to generate policies, and then rewrites the binary with the authenticated calls. As a result, hacker attacks, malicious software and the like are less likely to be successful in compromising any computers or networks that employ such authenticated system calls.
Methods And Systems For Transferring Data Over Electronic Networks,
Tue Dec 01 15:38:00 EST 2009
Methods and systems for managing the transfer of large data files across electronic data networks optimally in accordance with the desired results of the users. The present invention takes into consideration the user-defined transfer requirements, the data characteristics, and the characteristics of the entirety of the network, including both the access links and the backbone and processing and storage resources in the backbone. The present invention the enables users to more optimally transfer data within the limitations of the existing network capabilities, negating requirements to update local or remote network facilities.
Systems, Devices, And Methods For Initiating Recovery,
Tue May 19 15:38:39 EDT 2009
Certain exemplary embodiments comprise method that can comprise receiving information indicative of a fault from a monitor associated a network. The method can comprise, responsive to the information indicative of the fault, automatically determining a probability of a fault hypothesis. The method can comprise, responsive to a determination of the fault hypothesis, automatically initiating a recovery action to correct the fault.