
180 Park Ave - Building 103
Florham Park, NJ
Automatically Inferring Changes in Malicious Activity on the Internet.
Shobha Venkataraman, David Brumley, Subhabrata Sen, Oliver Spatscheck
NDSS,
2013.
[PDF]
[BIB]
ISOC Copyright
The definitive version was published in 2012. , 2013-02-24, http://www.cs.unc.edu/~amw/ndss2013/ISOC_copyright_forms.txt
{Internet-based services routinely contend with a range of malicious activity (e.g.,spam, scans, botnets) that can potentially arise from virtually any part of the global Internet infrastructure and that can shift longitudinally over time. In this paper, we address the challenging problem of discovering changes in malicious activity across the Internet. We take the approach of automatically inferring the partitions of the address space that can distinguish regions originating malicious activity from those that do not, based on input data. Conceptually we learn a decision tree over the IP address space that automatically determines the appropriate (and potentially different) aggregation levels for monitoring different parts of the Internet. In our prior work, we developed an algorithmic framework to create "static" Internet-scale decision tree snapshots of malicious activity. In this paper, we build on this framework and address the problem of discovering the evolution of malicious activity as one of detecting how the tree has evolved. Our evaluations using a large corpus of mail data indicate that our algorithms are fast, can keep up with Internet scale traffic data, and can extract changes in sources of spam activity substantially better than approaches based on using predetermined levels of aggregation such as BGP-based network-aware clusters. Using our algorithms, we find that some regions of the Internet are prone to much faster changes than others, which suggests that IP-based spam filtering mechanisms should take these dynamics into account to be most effective.
}

Making Sense of Customer Tickets in Cellular Networks
Yu Jin, Nicholas Duffield, Alexandre Gerber, Patrick Haffner, Wen Hsu, Guy Jacobson, Shobha Venkataraman, Zhi-Li Zhang, Subhabrata Sen
in Proc. IEEE INFOCOM Mini-Conference,
2011.
[PDF]
[BIB]
{Abstract�Effective management of large-scale cellular data
networks is critical to meet customer demands and expectations.
Customer calls for technical support provides direct indication as
to the issues and problems customers encounter. In this paper we
study the customer tickets � free-text recordings and classifications
by customer support agents � collected at a large cellular network
provider, with two inter-related goals: i) to characterize and
understand the major factors which lead to customers to call
and seek support; and ii) to utilize such customer tickets to
help identify potential network problems. For this purpose, we
develop a novel statistical approach to model customer call rates
which account for customer-side factors (e.g., user tenure and
handset types) as well as geo-locations. We show that most calls
are due to customer-side factors and can be well captured by the
model. Furthermore, we also demonstrate that location-specific
deviations from the model provide a good indicator of potential
network-side issues. The latter is corroborated with the detailed
analysis of customer tickets and other independent data sources
(non-ticket customer feedback and network performance data).}

Large-scale App-based Reporting of Customer Problems in Cellular Networks: Potential and Limitations
Yu Jin, Nicholas Duffield, Alexandre Gerber, Patrick Haffner, Wen Hsu, Guy Jacobson, Subhabrata Sen, Shobha Venkataraman, Zhi-Li Zhang
in Proc. ACM SIGCOMM Workshop on Measurements Up the STack (W-MUST),
2011.
[PDF]
[BIB]
ACM Copyright
(c) ACM, 2011. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in ACM SIGCOMM Workshop on Measurements Up the STack (Wâ€MUST) , 2011-08-19.
{Multidimensional distributions are often used in data min- ing to describe and summarize different features of large datasets. It is natural to look for distinct classes in such datasets by clustering the data. A common approach entails the use of methods like k-means clustering. However, the k-means method inherently relies on the Euclidean metric in the embedded space and does not account for additional topology underlying the distribution.
In this paper, we propose using Earth Mover Distance (EMD) to compare multidimensional distributions. For a n-bin histogram, the EMD is based on a solution to the transportation problem with time complexity O(n3 log n). To mitigate the high computational cost of EMD, we pro- pose an approximation that reduces the cost to linear time.
Other notions of distances such as the information theo- retic Kullback-Leibler divergence and statistical χ2 distance, account only for the correspondence between bins with the same index, and do not use information across bins, and are sensitive to bin size. A cross-bin distance measure like EMD is not affected by binning differences and meaningfully matches the perceptual notion of “nearness”.
Our technique is simple, efficient and practical for clus- tering distributions. We demonstrate the use of EMD on a practical application of clustering over 400,000 anonymous mobility usage patterns which are defined as distributions over a manifold. EMD allows us to represent inherent re- lationships in this space. We show that EMD allows us to successfully cluster even sparse signatures and we compare the results with other clustering methods. Given the large size of our dataset a fast approximation is crucial for this application.}

Internet-scale Visualization and Detection of Performance Events
Shobha Venkataraman, Jeffrey Pang, Subhabrata Sen, Oliver Spatscheck
Usenix Annual Technical Conference,
2011.
[PDF]
[BIB]
USENIX Copyright
The definitive version was published in Proceedings of the Annual Technical Conference, Usenix. , 2011-06-15
{Operators typically monitor the performance of network server farms
using rule-based scripts to automatically flag "events of interest" on
an array of active and passive performance measurement feeds.
However, such automatic detection is typically limited to events with
known properties. A different challenge involves detecting the
"unknown unknowns" -- the events of interest whose properties are
unknown, and therefore, cannot be defined beforehand. Visualization
can significantly aid the rapid discovery of such unknown patterns, as
network operators, with domain expertise, may quickly notice
unexpected shifts in traffic patterns when represented visually.
However, the volume of Internet-wide raw performance data can easily
overwhelm human comprehension, and therefore, an effective
visualization needs to be sparse in representation, yet discriminating
of good and poor performance.
This paper presents a tool that can be used to visualize performance metrics at Internet-scale. At its core, the tool builds decision trees over the IP address space using performance measurements, so that IP addresses with similar performance characteristics are clustered together, and those with significant performance differences are separated. These decision trees need to be dynamic -- i.e., learnt online, and adapt to changes in the underlying network. We build these adaptive decision trees by extending online decision-tree learning algorithms to the unique challenges of classifying performance measurements across the Internet, and
our tool then visualizes these adaptive decision trees, distinguishing parts of the
network with good performance from those with poor performance. We
show that the differences in the visualized decision trees helps us
quickly discover new patterns of usage and novel anomalies in latency
measurements at a large server farm.
}

Identifying Diverse Usage Behaviors of Smartphone Apps
Qiang Xu, Alexandre Gerber, Z. Morley Mao, Jeffrey Pang, Shobha Venkataraman, Jeffrey Erman
in Proc. of ACM Internet Measurement Conference (IMC),
2011.
[PDF]
[BIB]
ACM Copyright
(c) ACM, 20XX. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in Proc. of ACM Internet Measurement Conference (IMC). , 2011-11-01.
{as gateways to Internet services, rather than traditional web
browsers. Application marketplaces for iOS, Android, and
Windows Mobile platforms have made it attractive for developers
to deploy apps and easy for users to discover and
start using many network-enabled apps quickly. For example,
it was recently reported that the iOS AppStore has more
than 350K apps and more than 10 billion downloads. Furthermore,
the appearance of tablets and mobile devices with
other form factors, which also use these marketplaces, has
increased the diversity in apps and their user population.
Despite the increasing importance of apps as gateways to
network services, we have a much sparser understanding
of how, where, and when they are used compared to traditional
web services, particularly at scale. This paper takes
a first step in addressing this knowledge gap by presenting
results on smartphone app usage at a national level using
anonymized network measurements from a tier-1 cellular
carrier in the U.S. We identify traffic from distinct marketplace
apps based on HTTP signatures and present aggregate
results on their spatial and temporal prevalence, locality, and
correlation.}

BitShred: Feature Hashing Malware for Scalable Triage and Semantic Analysis
Jiyong Jang, David Brumley, Shobha Venkataraman
ACM Conference on Computer and Communications Security,
2011.
[PDF]
[BIB]
ACM Copyright
(c) ACM, 2011. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in ACM Conference on Computer and Communications Security , 2011-10-04.
The sheer volume of new malware found each day is enormous.
Worse, current trends show the amount of malware is doubling
each year. This exponential growth is quickly outpacing existing
malware analysis techniques. As a result, defenders in practice are
constantly faced with hard choices about which malware to analyze
and how much analysis should be done given the limited computing
resources available.
In this paper we propose efficient techniques for identifying malware
families and variants in order to provide scalable malware
triage capabilities. By providing the capability to quickly compare
and cluster malware into families, we enable defenders can make
more intelligent choices on which subsequent (potentially expensive)
analysis to perform, and on which samples. In particular, we
do not claim 100% accuracy, but instead strive for a balance that
maintains high accuracy for common variants but provides significantly
better scalability than previous approaches.
At the core of our work is an algorithm called BitShred for fast
similarity detection within binary code. We have implemented Bit-
Shred, and show that it is several factors to several orders of magnitude
faster than previous approaches, can take advantage of distributed
resources such as Hadoop, all while offering similar accuracy
at identifying malware families to previous approaches. We
also note that our techniques are applicable in other settings, e.g.,
measuring similarity based upon dynamic traces, and automatic
code reuse detection (such as plagiarism) in binary code.

Speed Testing without Speed Tests: Estimating Achievable Download Speed from Passive Measurements
Jeffrey Pang, Shobha Venkataraman, Oliver Spatscheck, Alexandre Gerber
in Proc. of ACM Internet Measurement Conference (IMC),
2010.
[PDF]
[BIB]
1 Copyright
(c) ACM, 2010. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution.
The definitive version was published in ACM Internet Measurement Conference , 2010-11-01.
{How fast is the network? The speed at which real users can download content at different locations and at different times is an important metric for service providers. Knowledge of this speed helps determine where to provision more capacity and helps detect network problems. However, most network-level estimates of these speeds today are obtained using active �speed tests� that place substantial load on the network and are not necessarily representative of actual user experiences due to limited vantage points. These problems are exacerbated in wireless networks where the physical locations of users play an important role in performance. To redress these problems, this paper presents a new technique to estimate achievable download speed using only flow records collected passively. Estimating achievable speed passively is non-trivial because the measured throughput of real flows is often not comparable to the achievable steady-state TCP rate. This can be because, for example, flows are small and never exit TCP slow start or are rate-limited by the content-provider. Our technique addresses these issues by constructing a Throughput Index, a list of flow types that accurately estimate achievable speed. We show that our technique estimates achievable throughput more accurately than other techniques in a large 3G wireless network.}

Tracking Dynamic Sources of Malicious Activity at Internet-Scale
Shobha Venkataraman, Subhabrata Sen, Oliver Spatscheck, Avrim Blum, Dawn Song
Neural Information Processing Systems (NIPS) 2009,
2009.
[PDF]
[BIB]
{We formulate and address the problem of discovering dynamic malicious regions on the Internet. We model this problem as one of adaptively pruning a known decision tree, but with additional challenges: (1) severe space requirements, since the underlying decision tree has over 4 billion leaves, and (2) a changing target function, since malicious activity on the Internet is dynamic. We present a novel algorithm that addresses this problem, by putting together a number of different "experts" algorithms and online paging algorithms. We prove guarantees on our algorithm?s performance as a function of the best possible pruning of a similar size, and our experiments show that our algorithmachieves high accuracy on large real-world data sets, with significant improvements over existing approaches. }