
180 Park Ave - Building 103
Florham Park, NJ
http://cormode.org
Method And Apparatus For Monitoring Functions Of Distributed Data,
Tue Nov 06 16:12:09 EST 2012
A method and system of monitoring computer network activity including determining a first phase frequency estimate, associated with a first frequency vector, determined in response to receiving first bits from a first plurality of remote computer network devices. The first bits received from the first plurality of remote devices in response to satisfying a first activity threshold. Also, determining a second phase frequency estimate associated with a second frequency vector and determined in response to receiving second bits from a second plurality of remote devices. The second bits received from the second plurality of remote devices in response to a second activity threshold being satisfied. The second phase frequency estimate determined in response to the first phase frequency estimate exceeding a global threshold. Further, providing a frequency moment F.sub.p in response to the second phase frequency estimate exceeding a refined threshold.
Methods And Apparatus To Determine Statistical Dominance Point Descriptors For Multidimensional Data,
Tue Apr 17 16:10:05 EDT 2012
Methods and apparatus to determine statistical dominance point descriptors for multidimensional data are disclosed. An example method disclosed herein comprises determining a first joint dominance value for a first data point in a multidimensional data set, data points in the multidimensional data set comprising multidimensional values, each dimension corresponding to a different measurement of a physical event, the first joint dominance value corresponding to a number of data points in the multidimensional data set dominated by the first data point in every dimension, determining a first skewness value for the first data point, the first skewness value corresponding to a size of a first dimension of the first data point relative to a combined size of all dimensions of the first data point, and combining the first joint dominance and first skewness values to determine a first statistical dominance point descriptor associated with the first data point.
Methods And Apparatus For Representing Probabilistic Data Using A Probabalistic Histogram,
Tue Mar 27 16:09:40 EDT 2012
Methods and apparatus for representing probabilistic data using a probabilistic histogram are disclosed. An example method comprises partitioning a plurality of ordered data items into a plurality of buckets, each of the data items capable of having a data value from a plurality of possible data values with a probability characterized by a respective individual probability distribution function (PDF), each bucket associated with a respective subset of the ordered data items bounded by a respective beginning data item and a respective ending data item, and determining a first representative PDF for a first bucket associated with a first subset of the ordered data items by partitioning the plurality of possible data values into a first plurality of representative data ranges and respective representative probabilities based on an error between the first representative PDF and a first plurality of individual PDFs characterizing the first subset of the ordered data items.
Verification Of Outsourced Data Streams,
Tue Feb 07 16:09:17 EST 2012
Embodiments disclosed herein are directed to verifying query results of an untrusted server. A data owner outsources a data stream to the untrusted server, which is configured to respond to a query from a client with the query result, which is returned to the client. The data owner can maintain a vector associated with query results returned by the server and can generate a verification synopsis using the vector and a seed. The verification synopsis includes a polynomial, where coefficients of the polynomial are determined based on the seed. The data owner outputs the verification synopsis and the seed to a client for verification of the query results.
Method And Apparatus For Monitoring Functions Of Distributed Data,
Tue Dec 13 16:02:21 EST 2011
This invention discloses continuous functional monitoring of distributed network activity using algorithms based on frequency moment calculations given by F.sub.p=.SIGMA..sub.im.sub.i.sup.p. The frequency moment calculations are used to raise an alarm when a value exceeds a certain threshold. Frequency moments for p=0, 1, and 2 are described.
System And Method For Encoding A Signal Using Compressed Sensor Measurements,
Tue Jan 04 16:01:46 EST 2011
Described is a system and method for receiving a signal for transmission and encoding the signal into a plurality of linear projections representing the signal. The encoding includes defining a transform matrix. The transform matrix being defined by processing the signal using a macroseparation matrix, processing the signal using a microseparation matrix and processing the signal using an estimation vector.
System And Method For Generating Statistical Descriptors For A Data Stream,
Tue Feb 02 15:50:19 EST 2010
Described is a system and method for receiving a data stream of multi-dimensional items, collecting a sample of the data stream having a predetermined number of items and dividing the sample into a plurality of subsamples, each subsample corresponding to a single dimension of each of the predetermined number of items. A query is then executed on a particular item in at least two of the subsamples to generate data for the corresponding subsample. This data is combined into a single value.
System and Method For Identifying Hierarchical Heavy Hitters In A Multidimensional Environment,
Tue Sep 15 15:38:46 EDT 2009
A method including receiving a plurality of elements of a data stream, storing a multi-dimensional data structure in a memory, said multi-dimensional data structure storing the plurality of elements as a hierarchy of nodes, each node having a frequency count corresponding to the number of elements stored therein, comparing the frequency count of each node to a threshold value based on a total number of the elements stored in the nodes and identifying each node for which the frequency count is at least as great as the threshold value as a hierarchical heavy hitter (HHH) node and propagating the frequency count of each non-HHH nodes to its corresponding parent nodes.
System And Method For Decoding A Signal Using Compressed Sensor Measurements,
Tue Sep 01 15:38:45 EDT 2009
Described is a system and method for receiving a plurality of linear projections representing a signal and decoding the plurality of linear projections to produce a reconstruction of the signal. The decoding includes defining a transform matrix and reconstructing coefficients and transforming a dictionary into a modified dictionary for a vector of real numbers. The transforming the transform matrix is performed through a macroseparation matrix, a microseparation matrix, and an estimation vector.
System and method for encoding a signal using compressed sensor measurements,
Tue Nov 11 18:13:08 EST 2008
Described is a system and method for receiving a signal for transmission and encoding the signal into a plurality of linear projections representing the signal. The encoding includes defining a transform matrix. The transform matrix being defined by processing the signal using a macroseparation matrix, processing the signal using a microseparation matrix and processing the signal using an estimation vector.
On Unifying the Space of $l_0$-Sampling Algorithms
Graham Cormode, Donatella Firmani
SIAM Meeting on Algorithm Engineering and Experiments,
2013.
[PDF]
[BIB]
The problem of building an $l_0$-sampler is to sample
near-uniformly from the support set of a dynamic multiset.
This problem has a variety of applications within data analysis,
computational geometry and graph algorithms.
In this paper, we abstract a set of steps for building an $l_0$-sampler, based on sampling, recovery and selection.
We analyze the implementation of an $l_0$-sampler within this
framework, and show how prior constructions of
$\ell_0$-samplers can all be expressed in terms of these steps.
Our experimental contribution is to provide a first detailed study of the
accuracy and computational cost of $l_0$-samplers.
Finding Interesting Correlations with Conditional Heavy Hitters
Graham Cormode, Katsiaryna Mirylenka, Themis Palpanas, Divesh Srivastava
International Conference on Data Engineering (ICDE),
2013.
[PDF]
[BIB]
The notion of {\em heavy hitters}---items that make up a large
fraction of the population---has been successfully used in a variety
of applications across sensor and RFID monitoring, network data
analysis, event mining, and more. Yet this notion often fails to
capture the semantics we desire when we observe data in the form of
correlated pairs. Here, we are interested in items that are {\em
conditionally} frequent: when a particular item is frequent within
the context of its parent item. In this work, we introduce and
formalize the notion of Conditional Heavy Hitters to identify such
items, with applications in network monitoring, and Markov chain modeling.
We introduce several streaming algorithms that allow us to find
conditional heavy hitters efficiently, and provide analytical results.
Different algorithms are successful for different input
characteristics. We perform an experimental evaluation to demonstrate
the efficacy of our methods, and to study which algorithms are most
suited for different types of data.

Empirical Privacy and Empirical Utility of Anonymized Data
Graham Cormode, Cecilia M. Procopiuc, Entong Shen, Divesh Srivastava, Ting Yu
Privacy-Preserving Data Publication and Analysis (PrivDB),
2013.
[BIB]
Procedures to anonymize data sets are vital for companies, government
agencies and other bodies to meet their obligations to share data
without compromising the privacy of the individuals contributing to it.
Despite much work on this topic, the area has not yet reached stability.
Early models ($k$-anonymity and $\ell$-diversity) are now thought to
offer insufficient privacy.
Noise-based methods like differential privacy are seen as providing
stronger privacy, but less utility.
However, across all methods
sensitive information of some individuals can often be
inferred with relatively high accuracy.
In this paper, we reverse the idea of a `privacy attack,' by incorporating it
into a measure of privacy.
Hence, we advocate the notion of {\em empirical privacy}, based on the posterior beliefs of an adversary,
and their ability to draw inferences about sensitive values in the
data.
This is not a new model, but rather a unifying view:
it allows us to study several well-known privacy models
which are not directly comparable otherwise.
We also consider an empirical approach to measuring utility, based on
a workload of queries.
Consequently, we are able to place different privacy models including
differential privacy and early syntactic models on
the same scale, and compare their privacy/utility tradeoff.
We learn that, in practice, the difference between differential
privacy and various syntactic models
is less dramatic than previously thought, but there are still clear
domination relations between them.

Accurate and Efficient Private Release of Datacubes and Contingency Tables
Graham Cormode, Cecilia M. Procopiuc, Divesh Srivastava, Grigory Yaroslavtsev
International Conference on Data Engineering (ICDE),
2013.
[PDF]
[BIB]
A central problem in releasing aggregate information about
sensitive data is to do so accurately
while providing a privacy guarantee on the output.
Recent work focuses on the class of {\em linear queries}, which include basic
counting queries, data cubes, and contingency tables. The goal is to
maximize the utility of their output, while giving a rigorous privacy
guarantee.
Most results follow a common template: pick a ``strategy'' set of linear
queries to apply to the data, then use the noisy answers to these
queries to reconstruct the queries of interest.
This entails either picking a strategy set that is hoped to be good
for the queries, or performing a costly search over the space
of all possible strategies.
In this paper, we propose a new approach that balances accuracy and
efficiency:
we show how to improve the accuracy of a given query set by
answering some strategy queries more accurately than others.
This leads to an efficient optimal noise allocation for many popular
strategies,
including wavelets, hierarchies, Fourier coefficients and more.
For the important case of marginal queries we show that this strictly improves on previous
methods, both analytically and empirically.
Our results also extend to ensuring that the returned query answers
are consistent with an (unknown) data set at minimal extra cost in
terms of time and noise.

Verifying Computations with Streaming Interactive Proofs
Graham Cormode, Justin Thaler, Ke Yi
International Conference on Very Large Data Bases (VLDB),
2012.
[PDF]
[BIB]
When computation is outsourced, the data owner would like to be
reassured that the desired computation has been performed correctly by
the service provider.
In theory, methods based on proof systems can give the data
owner the necessary assurance, but previous work does not give a
sufficiently scalable and practical solution, requiring a lot of time,
space or computational power for both parties.
In this paper, we develop
new proof protocols for verifying computations which are streaming in
nature: the verifier (data owner) needs only logarithmic space and a single pass over the input, and
after observing the input follows a simple
protocol with a prover (service provider) that takes logarithmic
communication spread over a logarithmic number
of rounds.
These ensure that the computation is performed correctly: that the
service provider has not made any errors or missed out some data.
The guarantee is very strong: even if the service provider
deliberately tries to cheat, there is only vanishingly small
probability of doing so undetected, while a correct computation is
always accepted.
We first observe that some existing theoretical results
can be modified to work with streaming verifiers,
which shows that there exist efficient protocols for
problems in the complexity classes $\mathsf{NP}$ and $\mathsf{NC}$.
Our main results
seek to bridge the gap between theory and practice by developing
usable protocols
for a variety of problems of central importance in
streaming and database processing.
All of our protocols achieve statistical
soundness and most require only logarithmic communication between
prover and verifier.
We also experimentally demonstrate their practicality and scalability. All
these problems require linear space in the traditional streaming model,
showing that adding a prover can exponentially reduce the effort needed
by the verifier.

Tracking Distributed Aggregates over Time-based Sliding Windows
Graham Cormode, Ke Yi
Scientific and Statistical Database Management (SSDBM),
2012.
[PDF]
[BIB]
The area of distributed monitoring requires tracking the value of a
function of distributed data as new observations are made.
An important case is when attention is restricted to only a recent
time period, such as the last hour of readings---the sliding window
case.
In this paper, we introduce a novel paradigm for handling such
monitoring problems, which we dub the ``forward/backward'' approach.
This view allows us to provide
optimal or near-optimal solutions for several fundamental problems, such
as counting, tracking frequent items, and maintaining order
statistics.
The resulting protocols improve on
previous work or give the first solutions for some problems,
and operate efficiently in terms of space and time needed.
Specifically, we obtain optimal $O(\frac{k}{\epsilon} \log (\eps n/k))$
communication per window of $n$ updates for tracking counts and heavy
hitters with accuracy $\epsilon$ across $k$ sites; and near-optimal
communication of $O(\frac{k}{\epsilon} \log^2(1/\epsilon) \log (n/k))$
for quantiles.
We also present solutions for problems such as tracking distinct
items, entropy, and convex hull and diameter of point sets.

Streaming Graph Computations with a Helpful Advisor
Graham Cormode, Michael Mitzenmacher, Justin Thaler
Algorithmica,
2012.
[PDF]
[BIB]
Motivated by the trend to outsource work to commercial cloud computing
services, we consider a variation of the streaming paradigm where a
streaming algorithm can be assisted by a powerful helper that can
provide annotations to the data stream. We extend previous work on
such {\em annotation models} by considering a number of graph
streaming problems. Without annotations, streaming algorithms for
graph problems generally require significant memory; we show that
for many standard problems, including all graph problems that can
be expressed with totally unimodular integer programming formulations,
only constant space (measured in words) is
needed for single-pass algorithms given linear-sized annotations.
We also obtain protocols achieving essentially \textit{optimal} tradeoffs between
annotation length and memory usage for several important problems, including
integer matrix-vector multiplication, as well as
shortest $s$-$t$ path in small-diameter graphs. We also obtain non-trivial tradeoffs for minimum
weight bipartite perfect matching and shortest $s$-$t$ path in general graphs.

Sketch Algorithms for Estimating Point Queries in NLP
Amit Goyal, Hal Daumé III, Graham Cormode
EMNLP-CoNLL,
pp 1093-1103,
2012.
[PDF]
[BIB]
Many NLP tasks rely on accurate statistics
from large corpora. Tracking complete
statistics is memory intensive, so recent
work has proposed using compact approximate
``sketches'' of frequency distributions.
We describe 10 sketch methods, including existing
and novel variants. We compare and
study the errors (over-estimation and underestimation)
made by the sketches. We evaluate
several sketches on three important NLP problems.
Our experiments show that {\em one} sketch
performs best for all the three tasks.
Scienceography: the study of how science is written
Graham Cormode, S. Muthukrishnan, Jinyun Yan
Proceedings of the International Conference on Fun with Algorithms (FUN),
2012.
[PDF]
[BIB]
Scientific literature has itself been the subject of much scientific
study, for a variety of reasons: understanding how results are
communicated, how ideas spread, and assessing the influence of areas
or individuals.
However, most prior work has focused on extracting and analyzing
citation and stylistic patterns.
In this work, we introduce the notion of `scienceography', which
focuses on the {\em writing} of science.
We provide a first large scale study using data derived from
the arXiv e-print repository.
Crucially, our data includes the ``source code'' of scientific papers---the \LaTeX source---which enables us to study features not present in the ``final product'',
such as the tools used and private comments between authors.
Our study identifies broad patterns and trends in two example areas---computer science and mathematics---as well as highlighting key
differences in the way that science is written in these fields.
Finally, we outline future directions to extend the new
topic of scienceography.

Practical Verified Computation with Streaming Interactive Proofs
Graham Cormode, Michael Mitzenmacher, Justin Thaler
Innovations in Theoretical Computer Science (ITCS),
2012.
[PDF]
[BIB]
When delegating computation to a service provider, as in the cloud computing
paradigm, we
seek some reassurance that the output is correct and complete.
Yet recomputing the output as a check is inefficient and expensive, and
it may not even be feasible to store all the data locally.
We are therefore interested in
what can be validated by a streaming (sublinear space) user, who
cannot store the full input, or perform the full computation herself.
Our aim in this work is to advance a recent line of work on ``proof
systems'' in which the service provider {\em proves} the correctness of
its output to a user.
The goal is to minimize the time and space costs of both parties in
generating and checking the proof.
Only very recently have there been attempts to implement such proof
systems, and thus far these have been quite limited in functionality.
Here, our approach is two-fold.
First, we describe a carefully crafted instantiation of one of the
most efficient
general-purpose constructions for arbitrary computations
(streaming or otherwise), due to Goldwasser,
Kalai, and Rothblum.
This requires several new insights and enhancements to
move the methodology from a theoretical result to a practical
possibility. Our main contribution is in achieving a prover that
runs in time $O(S(n) \log S(n))$, where $S(n)$ is the size of an arithmetic circuit computing the function of interest;
this compares favorably to the poly$(S(n))$ runtime for the prover promised in prior work. Our experimental results
demonstrate that a practical general-purpose protocol for verifiable computation may be significantly closer to reality than previously
realized.
Second, we describe a set of techniques that achieve
genuine scalability for protocols
fine-tuned for specific important problems in streaming and database
processing. Focusing in particular on \emph{non-interactive} protocols for problems ranging from matrix-vector multiplication
to bipartite perfect matching,
we build on prior work \cite{annotations, graphstream} to achieve a
prover that runs in nearly linear-time,
while obtaining optimal tradeoffs between communication cost and the user's working memory.
Existing techniques required (substantially)
superlinear time for the prover. Finally, we develop improved \emph{interactive} protocols for specific problems based on a linearization
technique originally due to Shen.
We argue that
even if general-purpose methods improve,
special purpose protocols will remain valuable in real-world settings for
key problems, and hence special attention to specific problems is
warranted.

Mergeable Summaries
Pankaj Agarwal, Graham Cormode, Zengfeng Huang, Jeff Phillips, Zheiwei Wei, Ke Yi
ACM Principles of Database Systems (PODS),
2012.
[PDF]
[BIB]
We study the {\em mergeability} of data summaries. Informally speaking,
mergeability requires that, given two summaries on two data sets, there is
a way to merge the two summaries into a single summary on the union of the two data sets, while preserving the error and size guarantees. This property means that the summaries can be merged in a way like other algebraic
operators such as sum and max,
which is especially useful for computing summaries on massive distributed
data. Several data summaries are trivially mergeable by construction, most
notably all the {\em sketches} that are linear functions of the data sets.
But some other fundamental
ones like those for heavy hitters and quantiles, are not (known to be)
mergeable. In this paper, we demonstrate that these summaries are indeed
mergeable or can be made mergeable after appropriate modifications.
Specifically, we show that for $\epsilon$-approximate heavy hitters, there is a
deterministic mergeable summary of size $O(1/\epsilon)$; for $\epsilon$-approximate
quantiles, there is a deterministic summary of size $O(\frac{1}{\epsilon}
\log(\epsilon n))$ that has a restricted form of mergeability, and a randomized
one of size $O(\frac{1}{\epsilon} \log^{3/2}\frac{1}{\epsilon})$ with full
mergeability. We also extend our results to geometric summaries such as
$\epsilon$-approximations and $\epsilon$-kernels.
We also achieve two results of independent interest:
(1) we provide the best known randomized streaming bound for $\epsilon$-approximate quantiles that depends only on $\epsilon$, of size $O(\frac{1}{\epsilon} \log^{3/2}\frac{1}{\epsilon})$, and
(2) we demonstrate that the MG and the SpaceSaving summaries for heavy hitters are isomorphic.

Don't Let The Negatives Bring You Down: Sampling from Streams of Signed Updates
Edith Cohen, Graham Cormode, Nick Duffield
ACM Conference on Measurement and Modeling of Computer Systems (SIGMETRICS),
2012.
[PDF]
[BIB]
Random sampling has been proven time and time again to be a powerful tool
for working with large data. Queries over the full dataset are replaced by
approximate queries over the smaller (and hence easier to store and
manipulate) sample. The sample constitutes a flexible summary that
supports a wide class of queries. But in many applications, datasets are
modified with time, and it is desirable to update samples without
requiring access to the full underlying datasets. In this paper, we
introduce and analyze novel techniques for sampling over dynamic data,
modeled as a stream of modifications to weights associated with each key.
While sampling schemes designed for stream applications can often readily
accommodate positive updates to the dataset, much less is known for the
case of negative updates, where weights are reduced or items deleted
altogether. We primarily consider the turnstile model of streams, and
extend classic schemes to incorporate negative updates. Perhaps
surprisingly, the modifications to handle negative updates turn out to be
natural and seamless extensions of the well-known positive
update-only algorithms. We show that they produce unbiased estimators, and
we relate their performance to the behavior of corresponding algorithms on
insert-only streams with different parameters. A careful analysis is
necessitated, in order to account for the fact that sampling choices for
one key now depend on the choices made for other keys.
In practice, our solutions turn out to be efficient and accurate.
Compared to recent algorithms for $L_p$ sampling which can be applied to this
problem, they are significantly more reliable, and dramatically faster.

Differentially Private Spatial Decompositions
Graham Cormode, Magda Procopiuc, Entong Shen, Divesh Srivastava, Ting Yu
International Conference on Data Engineering (ICDE),
2012.
[PDF]
[BIB]
Differential privacy has recently emerged as the de facto standard for private
data release.
This makes it possible to provide
strong theoretical guarantees on the privacy and utility of released
data.
While it is well-understood how to release data based on counts and simple
functions under this guarantee, it remains to provide general purpose
techniques that are useful for a wider variety of queries.
In this paper, we focus on spatial data, i.e., any multi-dimensional
data that can be indexed by a tree structure.
Directly applying existing differential privacy methods to this type
of data simply generates noise.
We propose instead the class of ``private spatial decompositions'':
these adapt standard spatial indexing methods such
as quadtrees and kd-trees to provide a private description of the
data distribution.
Equipping such structures with differential privacy requires several
steps to ensure that they provide meaningful privacy guarantees.
Various basic steps, such as choosing splitting points and describing
the distribution of points within a region, must be done privately,
and the guarantees of the different building blocks must be composed into
an overall guarantee.
Consequently, we expose the design space for private spatial
decompositions, and analyze some key examples.
A major contribution of our work is to
provide new techniques for parameter setting and
post-processing of the output to improve
the accuracy of query answers.
Our experimental study demonstrates that it is possible to build such
decompositions efficiently, and use them to answer a variety of
queries privately and with high accuracy.

Differentially Private Publication of Sparse Data
Graham Cormode, Magda Procopiuc, Divesh Srivastava, Thanh Tran
International Conference on Database Theory (ICDT),
2012.
[PDF]
[BIB]
The problem of privately releasing data is to provide a version of a
dataset without revealing sensitive information about the individuals
who contribute to the data. The model of differential privacy allows
such private release while providing strong guarantees on the output.
A basic mechanism achieves differential privacy by adding noise to the
frequency counts in the contingency tables (or, a subset of the count
data cube) derived from the dataset. However, when the dataset is
sparse in its underlying space, as is the case for most multi-attribute
relations, then the effect of adding noise is to vastly increase the
size of the published data: it implicitly creates a huge number of
dummy data points to mask the true data, making it almost impossible
to work with.
We present techniques to overcome this roadblock and allow efficient
private release of sparse data, while maintaining the guarantees of
differential privacy. Our approach is to release a compact summary of
the noisy data. Generating the noisy data and then summarizing it
would still be very costly, so we show how to shortcut this step, and
instead directly generate the summary from the input data, without
materializing the vast intermediate noisy data. We instantiate this
outline for a variety of sampling and filtering methods, and show how
to use the resulting summary for approximate, private, query answering.
Our experimental study shows that this is an effective, practical
solution: in some examples we generate output that is 1000 times
smaller than the naive method, in less than 1\% of the time while
achieving comparable and occasionally improved utility over the
costly materialization approach.

Continuous sampling from distributed streams
Graham Cormode, S. Muthukrishnan, Ke Yi, Qin Zhang
Journal of the ACM (JACM),
v59,
#2,
2012.
[PDF]
[BIB]
A fundamental problem in data management is to draw and maintain a sample
of a large data set, for approximate query answering, selectivity
estimation, and query planning. With large, streaming data sets, this
problem becomes particularly difficult when the data is shared across
multiple distributed sites. The main challenge is to ensure that a
sample is drawn uniformly across the union of the data while minimizing
the communication needed to run the protocol on the evolving data. At
the same time, it is also necessary to make the protocol lightweight, by
keeping the space and time costs low for each participant.
%
In this paper, we present communication-efficient protocols for
continuously maintaining a sample (both with and without replacement)
from $k$ distributed streams. These apply to the case when we want a
sample from the full streams, and to the sliding window cases of only the
$W$ most recent elements, or arrivals within the last $w$ time units. We
show that our protocols are optimal (up to logarithmic factors), not just
in terms of the communication used, but also the time and space costs for
each participant.

Approximating Data with the Count-Min Data Structure
Graham Cormode, S. Muthukrishnan
IEEE Software,
2012.
[PDF]
[BIB]
Sketch data structures have been introduced to compactly summarize massive data sets and allow key properties of the data to be approximated accurately. In this paper, we introduce a prominent example, the Count-Min sketch, and describe how it accurately solves a fundamental problem of tracking counts of items in large data sets. Applications, implementations and extensions are also discussed.
Aggregate Query Answering on Possibilistic Data with Cardinality Constraints
Graham Cormode, Entong Shen, Divesh Srivastava, Ting Yu
International Conference on Data Engineering (ICDE),
2012.
[PDF]
[BIB]
Uncertainties in data arise for a number of reasons: when the data
set is
incomplete, contains conflicting information or has been deliberately
perturbed or coarsened to remove sensitive details. An important case
which arises in many real applications
is when the data describes a set of {\em possibilities}, but with
{\em cardinality constraints}. These
constraints represent correlations between tuples encoding, e.g. that
at most two possible records are correct, or that there is an
(unknown) one-to-one mapping between a set of tuples and attribute
values. Although there has been much effort to handle
uncertain data, current systems are not equipped to handle such
correlations, beyond simple mutual exclusion and co-existence
constraints. Vitally, they have little support for efficiently handling
{\em aggregate} queries on such data.
In this paper, we aim to address some of these deficiencies, by
introducing LICM (Linear Integer Constraint Model), which can
succinctly represent many types of tuple correlations, particularly a
class of {\em cardinality constraints}. We motivate and explain the
model with examples from data cleaning and masking sensitive data,
to show that it enables
modeling and querying such data, which was not previously possible. We
develop an efficient strategy to answer conjunctive and aggregate queries
on possibilistic data by describing how to implement relational
operators over data in the model. LICM compactly integrates the
encoding of correlations, query answering and lineage recording. In
combination with off-the-shelf linear integer programming solvers, our
approach provides exact bounds for aggregate queries. Our prototype
implementation demonstrates that query answering with LICM can be
effective and scalable.

A Dataset Search Engine for the Research Document Corpus
Meiyu Lu, Srinivas Bangalore, Graham Cormode, Marios Hadjieleftheriou, Divesh Srivastava
International Conference on Data Engineering (ICDE),
2012.
[PDF]
[BIB]
A key step in validating a proposed idea or system
is to evaluate over a suitable dataset. However, to this date there
have been no useful tools for researchers to understand which
datasets have been used for what purpose, or in what prior
work. Instead, they have to manually browse through papers
to find the suitable datasets and their corresponding URLs,
which is laborious and inefficient. To better aid the dataset
discovery process, and provide a better understanding of how
and where datasets have been used, we propose a framework to
effectively identify datasets within the scientific corpus. The key
technical challenges are identification of datasets, and discovery
of the association between a dataset and the URLs where they
can be accessed. Based on this, we have built a user friendly
web-based search interface for users to conveniently explore the
dataset-paper relationships, and find relevant datasets and their
properties.

A Dataset Search Engine for the Research Document Corpus
Meiyu Lu, Srinivas Bangalore, Graham Cormode, Marios Hadjieleftheriou, Divesh Srivastava
International Conference on Data Engineering (ICDE),
2012.
[PDF]
[BIB]
A key step in validating a proposed idea or system
is to evaluate over a suitable dataset. However, to this date there
have been no useful tools for researchers to understand which
datasets have been used for what purpose, or in what prior
work. Instead, they have to manually browse through papers
to find the suitable datasets and their corresponding URLs,
which is laborious and inefficient. To better aid the dataset
discovery process, and provide a better understanding of how
and where datasets have been used, we propose a framework to
effectively identify datasets within the scientific corpus. The key
technical challenges are identification of datasets, and discovery
of the association between a dataset and the URLs where they
can be accessed. Based on this, we have built a user friendly
web-based search interface for users to conveniently explore the
dataset-paper relationships, and find relevant datasets and their
properties.

Tracking Distributed Aggregates over Time-based Sliding Windows (Brief Announcement)
Graham Cormode, Ke Yi
ACM Principles of Distributed Computing (PODC),
2011.
[PDF]
[BIB]
The area of distributed monitoring requires tracking the value of a
function of distributed data as new observations are made.
An important case is when attention is restricted to only a recent
time period, such as the last hour of readings---the sliding window
case.
In this announcement, we outline a novel paradigm for handling such
monitoring problems, which we dub the ``forward/backward'' approach.
This provides clean solutions for several fundamental problems, such
as counting, tracking frequent items, and maintaining order
statistics.
We obtain efficient protocols for these problems that improve on
previous work, and are easy to implement.
Specifically, we obtain optimal $O(\frac{k}{\epsilon} \log (\epsilon n/k))$
communication per window of $n$ updates for tracking counts and heavy
hitters with accuracy $\epsilon$ across $k$ sites; and near-optimal
communication of $O(\frac{k}{\epsilon} \log^2(1/\epsilon)$ $\log (n/k))$
for quantiles.

Structure-Aware Sampling: Flexible and Accurate Summarization
Edith Cohen, Graham Cormode, Nick Duffield
International Conference on Very Large Data Bases (VLDB),
2011.
[PDF]
[BIB]
In processing large quantities of data, a fundamental problem is to
obtain a summary which supports approximate query answering. Random
sampling yields flexible summaries which naturally support subset-sum
queries with unbiased estimators and well-understood confidence
bounds.
Classic sample-based summaries, however, are designed for arbitrary
subset queries and are oblivious to the structure in the set of keys.
The particular structure, such as hierarchy, order, or product space
(multi-dimensional), makes {\em range queries} much more relevant for
most analysis of the data.
Dedicated summarization algorithms for range-sum queries have also been
extensively studied. They can outperform existing sampling schemes in
terms of accuracy on range queries per summary size. Their accuracy,
however, rapidly degrades when, as is often the case, the query spans
multiple ranges.
They are also less flexible---being targeted for range sum
queries alone---and are often quite costly to build and use.
In this paper we propose and evaluate variance optimal sampling
schemes that are {\em structure-aware}.
These summaries improve over
the accuracy of existing {\em structure-oblivious} sampling schemes on
range queries while retaining the benefits of sample-based summaries:
flexible summaries, with high accuracy on
both range queries and arbitrary subset queries.

Structure-Aware Sampling on Data Streams
Edith Cohen, Graham Cormode, Nick Duffield
ACM Conference on Measurement and Modeling of Computer Systems (SIGMETRICS),
2011.
[PDF]
[BIB]
The massive data streams observed in network monitoring, data processing and
scientific studies are typically too large to store.
For many applications over such data, we must obtain compact summaries
of the stream.
These summaries should allow accurate answering of post hoc queries
with estimates which approximate the true answers over the original stream.
The data often has an underlying structure which makes certain subset
queries, in particular {\em range queries}, more relevant than arbitrary
subsets. Applications such as access control, change
detection, and heavy hitters typically involve subsets that are ranges
or unions thereof.
Random sampling is a natural summarization tool,
being easy to implement and flexible to use.
Known sampling methods are
good for arbitrary queries but fail to optimize for the common
case of range queries.
Meanwhile, specialized summarization algorithms have been proposed
for range-sum queries and related problems.
These can outperform sampling giving fixed space resources, but lack
its flexibility and simplicity.
Particularly, their accuracy degrades when queries span multiple
ranges.
We define new stream sampling algorithms
with a smooth and tunable trade-off between
accuracy on range-sum queries and arbitrary subset-sum queries.
The technical key is to relax requirements on the variance over all
subsets to enable better performance on the ranges of interest.
This boosts the accuracy on range queries while retaining the prime
benefits of sampling, in particular flexibility and accuracy, with tail
bounds guarantees.
Our experimental study indicates that structure-aware summaries
drastically improve range-sum accuracy with respect to
state-of-the-art stream sampling algorithms and outperform
deterministic methods on range-sum queries and hierarchical heavy
hitter queries.

Semantics of Ranking Queries for Probabilistic Data
Graham Cormode, Jeffrey Jestes, Feifei Li, Ke Yi
IEEE Transactions on Knowledge and Data Engineering,
v23,
#12,
pp 1903--1917,
2011.
[PDF]
[BIB]
Recently, there have been several attempts to
propose definitions and algorithms for ranking queries on
probabilistic data. However, these lack many intuitive
properties of a top-$k$ over deterministic data. We define numerous
fundamental properties, including {\em exact-$k$}, {\em
containment}, {\em unique-rank}, {\em value-invariance}, and {\em
stability}, which are satisfied by ranking queries on certain
data. We argue these properties should also be carefully studied
in defining ranking queries in probabilistic data, and fulfilled by
definition for ranking uncertain data for most applications.
We propose an intuitive new ranking definition based on the
observation that the ranks of a tuple across all
possible worlds represent a well-founded rank distribution. We
studied the ranking definitions based on the expectation, the median
and other statistics of this rank distribution for a tuple and
derived the {\em expected rank, median rank and quantile rank}
correspondingly. We are able to prove that the expected rank, median
rank and quantile rank satisfy all these properties for a ranking
query. We provide efficient solutions to compute such rankings across
the major models of uncertain data, such as attribute-level and
tuple-level uncertainty.
Finally, a comprehensive experimental study
confirms the effectiveness of our approach.

Semantics of Ranking Queries for Probabilistic Data
Graham Cormode, Jeffrey Jestes, Feifei Li, Ke Yi
IEEE Transactions on Knowledge and Data Engineering,
v23,
#12,
pp 1903--1917,
2011.
[PDF]
[BIB]
Recently, there have been several attempts to
propose definitions and algorithms for ranking queries on
probabilistic data. However, these lack many intuitive
properties of a top-$k$ over deterministic data. We define numerous
fundamental properties, including {\em exact-$k$}, {\em
containment}, {\em unique-rank}, {\em value-invariance}, and {\em
stability}, which are satisfied by ranking queries on certain
data. We argue these properties should also be carefully studied
in defining ranking queries in probabilistic data, and fulfilled by
definition for ranking uncertain data for most applications.
We propose an intuitive new ranking definition based on the
observation that the ranks of a tuple across all
possible worlds represent a well-founded rank distribution. We
studied the ranking definitions based on the expectation, the median
and other statistics of this rank distribution for a tuple and
derived the {\em expected rank, median rank and quantile rank}
correspondingly. We are able to prove that the expected rank, median
rank and quantile rank satisfy all these properties for a ranking
query. We provide efficient solutions to compute such rankings across
the major models of uncertain data, such as attribute-level and
tuple-level uncertainty.
Finally, a comprehensive experimental study
confirms the effectiveness of our approach.

Personal Privacy vs Population Privacy: Learning to Attack Anonymization
Graham Cormode
ACM SIGKDD Conference,
2011.
[PDF]
[BIB]
Over the last decade
great strides have been made in developing techniques to
compute functions privately.
In particular, Differential Privacy gives strong promises about
conclusions that can be drawn about an individual.
In contrast, various syntactic methods for providing privacy (criteria
such as $k$-anonymity and $l$-diversity) have been criticized for
still allowing private information of an individual to be inferred.
In this paper, we consider the ability of an attacker to use data
meeting privacy definitions to build an accurate classifier.
We demonstrate that even under Differential Privacy, such classifiers can be
used to infer ``private'' attributes accurately in realistic data.
We compare this to similar approaches for inference-based attacks on
other forms of anonymized data.
We show how the efficacy of all these attacks can be measured on the
same scale, based on the probability of successfully inferring a
private attribute.
We observe that the
accuracy of inference of private attributes for differentially private
data and $l$-diverse data can be quite similar.

Mergeable Coresets
Pankaj Agarwal, Graham Cormode, Zengfeng Huang, Jeff Phillips, Zheiwei Wei, Ke Yi
Third Workshop on Massive Data Algorithmics (MASSIVE),
2011.
[PDF]
[BIB]
We study the {\em mergeability} of data summaries. Informally speaking,
mergeability requires that, given two summaries on two data sets, there is
a way to merge the two summaries into a summary on the two data sets
combined together, while preserving the error and size guarantees. This
property means that the summary can be treated like other algebraic
objects such as sum and max,
% essentially allows the summary to be used as if they were some
%simple decomposable aggregates like sum and max,
which is especially useful for computing summaries on massive distributed
data. Many data summaries are trivially mergeable by construction, most
notably those based on linear transformations. But some other fundamental
ones like those for heavy hitters and quantiles, are not (known to be)
mergeable. In this paper, we demonstrate that these summaries are indeed
mergeable or can be made mergeable after appropriate modifications.
Specifically, we show that for $\epsilon$-approximate heavy hitters, there is a
deterministic mergeable summary of size $O(1/\epsilon)$; for $\epsilon$-approximate
quantiles, there is a deterministic summary of size $O(\frac{1}{\epsilon}
\log(\epsilon n))$ that has a restricted form of mergeability, and a randomized
one of size $O(\frac{1}{\epsilon} \log^{3/2}\frac{1}{\epsilon})$ with full
mergeability. We also extend our results to geometric summaries such as
$\epsilon$-approximations and $\epsilon$-kernels.

Algorithms for distributed functional monitoring
Graham Cormode, S. Muthukrishnan, Ke Yi
ACM Transactions on Algorithms,
v7,
#2,
pp 1--21,
2011.
[PDF]
[BIB]
We study what we call {\em functional monitoring} problems. We have $k$
players each receiving a stream of items, and communicating with a central
coordinator. Let the multiset of items received by player $i$ up until
time $t$ be $A_i(t)$. The coordinator's task is to monitor a given
function $f$ computed over the union of the inputs $\cup_{i} A_i(t)$, {\em
continuously} at all times $t$. The goal is to minimize the number of
bits communicated between the players and the coordinator. Of interest is
the approximate version where the coordinator outputs $1$ if $f \geq \tau$
and $0$ if $f\leq (1-\epsilon)\tau$. This defines the
$(k,f,\tau,\epsilon)$ distributed functional monitoring problem.
Functional monitoring problems are fundamental in distributed systems, in
particular sensor networks, where we must minimize communication; they also
connect to the well studied streaming model and communication complexity.
Yet few formal bounds are known for functional monitoring.
We give upper and lower bounds for the $(k,f,\tau,\epsilon)$ problem for
some of the basic $f$'s. In particular, we study the frequency moments
$F_p$ for $p=0,1,2$. For $F_0$ and $F_1$, we obtain monitoring algorithms
with costs almost the same as their one-shot computation algorithms.
However, for $F_2$ the monitoring problem seems much harder. We give a
carefully constructed multi-round algorithm that uses ``sketch summaries''
at multiple levels of details and solves the $(k,F_2,\tau,\epsilon)$
problem with communication $\tilde{O}(k^2/\epsilon + k^{3/2}/\epsilon^3)$.
Our algorithmic techniques are likely to be useful for other functional
monitoring problems as well.

Streaming Graph Computations with a Helpful Advisor
Graham Cormode, Michael Mitzenmacher, Justin Thaler
European Symposium on Algorithms,
2010.
[PDF]
[BIB]
Motivated by the trend to outsource work to commercial cloud computing
services, we consider a variation of the streaming paradigm where a
streaming algorithm can be assisted by a powerful helper that can
provide annotations to the data stream. We extend previous work on
such {\em annotation models} by considering a number of graph
streaming problems. Without annotations, streaming algorithms for
graph problems generally require significant memory; we show that
for many standard problems, including all graph problems that can
be expressed with totally unimodular integer programming formulations,
only constant memory is
needed for single-pass algorithms given linear-sized annotations.
We also obtain a protocol achieving \textit{optimal} tradeoffs between
annotation length and memory usage for matrix-vector multiplication;
this result contributes to a trend of recent research on numerical
linear algebra in streaming models.

Space-optimal Heavy Hitters with Strong Error Bounds
Radu Berinde, Graham Cormode, Piotr Indyk, Martin Strauss
ACM Transactions on Database Systems,
v35,
#4,
2010.
[PDF]
[BIB]
The problem of finding heavy hitters and approximating the frequencies
of items is at the heart of many problems in data stream analysis. It
has been observed that several proposed solutions to this problem can
outperform their worst-case guarantees on real data. This leads to the
question of whether some stronger bounds can be guaranteed. We answer
this in the positive by showing that a class of ``counter-based
algorithms'' (including the popular and very space-efficient Frequent
and SpaceSaving algorithms) provide much stronger approximation
guarantees than previously known. Specifically, we show that errors in
the approximation of individual elements do not depend on the
frequencies of the most frequent elements, but only on the frequency
of the remaining ``tail.''
This shows that counter-based methods are the
most space-efficient (in fact, space-optimal) algorithms having this
strong error bound.
This tail guarantee allows these algorithms to solve the ``sparse
recovery'' problem. Here, the goal is to recover a faithful
representation of the vector of frequencies, $f$. We prove that using
space $O(k)$, the algorithms construct an approximation $f^*$ to the
frequency vector $f$ so that the $L_1$ error $||f-f^*||_1$ is close to the best
possible error $\min_{f'} ||f' - f||_1$, where $f'$ ranges over all vectors
with at most $k$ non-zero entries. This improves the previously best
known space bound of about $O(k \log n)$ for streams without element deletions
(where $n$ is the size of the domain from which stream elements are drawn). Other
consequences of the tail guarantees are results for skewed (Zipfian) data, and
guarantees for accuracy of merging multiple summarized streams.

Set Cover Algorithms For Very Large Datasets
Graham Cormode, Howard Karloff, Tony Wirth
ACM Conference on Information and Knowledge Management (CIKM),
2010.
[PDF]
[BIB]
The problem of {\sc Set Cover}---to find the smallest subcollection of
sets that covers some universe---is at the heart of many data and
analysis tasks. It arises in a wide range of settings, including
operations research, machine learning, planning, data quality and data
mining. Although finding an optimal solution is NP-hard, the greedy
algorithm is widely used, and typically finds solutions that are close
to optimal.
However, a direct implementation of the greedy approach, which
picks the set with the largest number of uncovered items at each step,
does not behave well when the input is very large and disk resident. The greedy
algorithm must make many random accesses to disk, which are
unpredictable and costly in comparison to linear scans. In order to
scale {\sc Set Cover} to large datasets, we provide a new
algorithm which finds a solution that is provably close to that of
greedy, but which is much more efficient to implement using modern
disk technology. Our experiments show a ten-fold improvement in
speed on moderately-sized datasets, and an even greater improvement on
larger datasets.

Privacy in dynamic social networks
Smriti Bhagat, Graham Cormode, Balachander Krishnamurthy, Divesh Srivastava
World Wide Web Conference (WWW),
2010.
[PDF]
[BIB]
Anonymization of social networks before they are published or shared
has become an important research question. Recent work on anonymizing
social networks has looked at privacy preserving techniques for
publishing a single instance of the network. However, social networks
evolve and a single instance is inadequate for analyzing the evolution
of the social network or for performing any longitudinal data
analysis. We study the problem of repeatedly publishing social network
data as the network evolves, while preserving privacy of users.
Publishing multiple instances of the same network independently
has privacy risks, since stitching the information
together may allow an adversary to identify users in the networks.
We propose methods to anonymize a dynamic network such that the privacy of
users is preserved when new nodes and edges are added to the published
network.
These methods make
use of link prediction algorithms to model the evolution of the
social network. Using this predicted graph to perform group-based
anonymization, the loss in privacy caused by new edges can be reduced.
We evaluate the privacy loss on publishing multiple social network instances using our methods.

Prediction Promotes Privacy In Dynamic Social Networks
Smriti Bhagat, Graham Cormode, Balachander Krishnamurthy, Divesh Srivastava
Workshop on Online Social Networks (WOSN),
2010.
[PDF]
[BIB]
Recent work on anonymizing online social networks (OSNs)
has looked at privacy preserving techniques
for publishing a single instance of the network. However, OSNs evolve and
a single instance is inadequate for analyzing their evolution or performing
longitudinal data analysis. We study the problem of repeatedly publishing
OSN data as the network evolves while preserving privacy of users.
Publishing multiple instances independently has privacy risks, since stitching
the information together may allow an adversary to identify users. We provide
methods to anonymize a dynamic network when new nodes and edges are added
to the published network.
These methods
use link prediction algorithms to model the evolution. Using this predicted
graph to perform group-based anonymization, the loss in privacy caused by
new edges can be eliminated almost entirely.
We propose metrics for privacy loss, and evaluate them for publishing multiple OSN instances.

Optimal sampling from distributed streams
Graham Cormode, S. Muthukrishnan, Ke Yi, Qin Zhang
ACM Principles of Database Systems (PODS),
2010.
[PDF]
[BIB]
A fundamental problem in data management is to draw a sample of a large
data set, for approximate query answering, selectivity estimation, and
query planning. With large, streaming data sets, this problem becomes
particularly difficult when the data is shared across multiple
distributed sites. The challenge is to ensure that a sample is drawn
uniformly across the union of the data while minimizing the communication
needed to run the protocol and track parameters of the evolving data. At
the same time, it is also necessary to make the protocol lightweight, by
keeping the space and time costs low for each participant.
In this paper, we present communication-efficient protocols for sampling
(both with and without replacement) from $k$ distributed streams. These
apply to the case when we want a sample from the full streams, and to the
sliding window cases of only the $W$ most recent items, or arrivals
within the last $w$ time units. We show that our protocols are optimal,
not just in terms of the communication used, but also that they use
minimal or near minimal (up to logarithmic factors) time to process each
new item, and space to operate.

Minimizing Minimality and Maximizing Utility: Analyzing Method-based attacks on Anonymized Data
Graham Cormode, Ninghui Li, Tiancheng Li, Divesh Srivastava
International Conference on Very Large Data Bases (VLDB),
2010.
[BIB]
The principle of {\em anonymization} for data sharing has become a very popular paradigm for the
preservation of privacy of the data subjects. Since the introduction of $k$-anonymity, dozens of
methods and enhanced privacy definitions have been proposed. However, over-eager attempts to
minimize the information lost by the anonymization potentially allow private information to be
inferred. Proof-of-concept of this ``minimality attack'' has been demonstrated for a variety of
algorithms and definitions \cite{WFW+09}.\\
In this paper, we provide a comprehensive analysis and study of this attack, and demonstrate that
with care its effect can be almost entirely countered. The attack allows an adversary to increase
his (probabilistic) belief in certain facts about individuals over the data. We show that (a) a
large class of algorithms are not affected by this attack, (b) for a class of algorithms that have
a ``symmetric'' property, the attacker's belief increases by at most a small constant, and (c) even
for an algorithm chosen to be highly susceptible to the attack, the attacker's belief when using
the attack increases by at most a small constant factor. We also provide a series of experiments
that show in all these cases that the confidence about the sensitive value of any individual
remains low in practice, while the published data is still useful for its intended purpose. From
this, we conclude that the impact of such method-based attacks can be minimized.

Methods for finding frequent items in data streams
Graham Cormode, Marios Hadjieleftheriou
The VLDB Journal,
v19,
#1,
pp 3--20,
2010.
[PDF]
[BIB]
The frequent items problem is to process a stream of items and find all items occurring more than a given fraction of the time. It is one of the most heavily studied problems in data stream mining, dating back to the 1980s. Many applications rely directly or indirectly on finding the frequent items, and implementations are in use in large scale industrial systems. However, there has not been much comparison of the different methods under uniform experimental conditions. It is common to find papers touching on this topic in which important related work is mischaracterized, overlooked, or reinvented. In this paper, we aim to present the most important algorithms for this problem in a common framework. We have created baseline implementations of the algorithms and used these to perform a thorough experimental study of their properties. We give empirical evidence that there is considerable variation in the performance of frequent items algorithms. The best methods can be implemented to find frequent items with high accuracy using only tens of kilobytes of memory, at rates of millions of items per second on cheap modern hardware.

Information Cost Tradeoffs for Augmented Index and Streaming Language Recognition
Amit Chakrabarti, Graham Cormode, Ranga Kondapally, Andrew McGregor
IEEE Foundations of Computer Science (FOCS),
2010.
[PDF]
[BIB]
This paper makes three main contributions to the theory of communication
complexity and stream computation. First, we present new bounds on the
information complexity of {\sc Augmented Index}. In contrast to analogous results for {\sc Index}
by Jain, Radhakrishnan and Sen [\emph{J.~ACM}, 2009], we have to overcome the
significant technical challenge that protocols for {\sc Augmented Index} may violate the
``rectangle property'' due to the inherent input sharing. Second, we use these
bounds to resolve an open problem of Magniez, Mathieu and Nayak [\emph{STOC},
2010] that asked about the multi-pass complexity of recognizing Dyck
languages. This results in a natural separation between the standard
multi-pass model and the multi-pass model that permits reverse passes. Third,
we present the first \emph{passive memory checkers} that verify the
interaction transcripts of priority queues, stacks, and double-ended queues.
We obtain tight upper and lower bounds for these problems, thereby addressing
an important sub-class of the memory checking framework of Blum et al.
[\emph{Algorithmica}, 1994].

Histograms and Wavelets on Probabilistic Data
Graham Cormode, Minos Garofalakis
IEEE Transactions on Knowledge and Data Engineering,
v22,
#8,
pp 1142-1157,
2010.
[PDF]
[BIB]
There is a growing realization that uncertain information is a first-class citizen in modern database management. As such, we need techniques to correctly and efficiently process uncertain data in database systems. In particular, data reduction techniques that can produce concise, accurate synopses of large probabilistic relations are crucial. Similar to their deterministic relation counterparts, such compact probabilistic data synopses can form the foundation for human understanding and interactive data exploration, probabilistic query planning and optimization, and fast approximate query processing in probabilistic database systems. In this paper, we introduce definitions and algorithms for building histogram- and Haar wavelet-based synopses on probabilistic data. The core problem is to choose a set of histogram bucket boundaries or wavelet coefficients to optimize the accuracy of the approximate representation of a collection of probabilistic tuples under a given error metric. For a variety of different error metrics, we devise efficient algorithms that construct optimal or near optimal size B histogram and wavelet synopses. This requires careful analysis of the structure of the probability distributions, and novel extensions of known dynamic-programming-based techniques for the deterministic domain. Our experiments show that this approach clearly outperforms simple ideas, such as building summaries for samples drawn from the data distribution, while taking equal or less time.

Anonymizing bipartite graph data using safe groupings
Graham Cormode, Divesh Srivastava, Ting Yu, Qing Zhang
The VLDB Journal,
v19,
#1,
pp 115--139,
2010.
[PDF]
[BIB]
Private data often comes in the form of associations between entities,
such as customers and products bought from a pharmacy, which are
naturally represented in the form of a large, sparse bipartite graph.
As with tabular data, it is desirable to be able to publish
anonymized versions of such data, to allow others to perform ad hoc
analysis of aggregate graph properties.
However, existing tabular anonymization techniques do not give useful
or meaningful results when applied to graphs: small changes or masking
of the edge structure can radically change aggregate graph properties.
We introduce a new family of anonymizations for bipartite graph
data, called $(k,l)$-groupings. These groupings preserve the
underlying graph structure perfectly, and instead anonymize the
mapping from entities to nodes of the graph. We identify a class
of ``safe'' $(k,l)$-groupings that have provable guarantees to
resist a variety of attacks, and show how to find such safe
groupings. We perform experiments on real bipartite graph data to
study the utility of the anonymized version, and the impact of
publishing alternate groupings of the same graph data. Our
experiments demonstrate that $(k,l)$-groupings offer strong
tradeoffs between privacy and utility.

Algorithms for Next Generation Networks
G. Cormode and M. Thottan,
Springer,
2010.
[BIB]
Since the early 1990s coupled with the widespread deployment of broadband to the home, we have seen remarkable progress in the ease of Internet accessibility to end users. Both commercial and private sectors rely heavily on the availability of the Internet to conduct normal day to day functions. Underpinning this exponential growth in popularity of the Internet are the advances made in the applications of basic algorithms to design and architect the Internet. The most obvious example of these algorithms is the use of search engines to collect and correlate vast amounts of information that is spread throughout the Internet.
With the dawn of this new century, we are now on the verge of expanding the notion of what we mean to communicate. A new generation of netizens are poised to leverage the Internet for a myriad different applications that we have not envisioned thus far. This will require that the Internet be flexible and adapt to accommodate the requirements of next generation applications. To address this challenge, in the United States, the National Science Foundation has initiated a large research project GENI. The goal of GENI is to perform a clean-slate design for a new Internet. In particular, the aim of this project is to rethink the basic design assumptions on which the current Internet is built, with the possibility that to improve flexibility for new services we may arrive at a radically different Internet, beyond what one might imagine from evolving the current network. Given this context of internet research, the purpose of this book is to provide a comprehensive survey of present algorithms and methodologies used in the design and deployment of the Internet. We believe that a thorough understanding of algorithms used by the Internet today is critical to develop new algorithms that will form the basis of the future Internet.
The book is divided into 3 parts dealing with the application of algorithms to different aspects of network design, operations and next generation applications. Part 1 provides an algorithmic basis for the design of networks both at the physical and the service layer. This part is extensive since it considers different physical layer network technologies. The second part of this book covers two important topics of network operations and management. As we know today, network providers have already completed field trials for the 100Gbps network. It should not be long before these capacities become common place on the Internet. The challenge of processing packets at such high speeds imposes a tremendous significance on efficient and fast packet processing algorithms. Part 3 of this book discusses algorithmic techniques that form the basis of emerging applications.
In this book we have attempted to provide a flavor of how algorithms have formed the basis of the Internet as we know it today. It is our hope that this book will provide a useful overview of algorithms applied to communication networks, for any student who aspires to do research in network architecture as well the application of algorithms to communication networks. We believe that for a robust design of the future Internet, it is essential that the architecture be founded on the basis of sound algorithmic principles.

A Near-Optimal Algorithm for Computing the Entropy of a Stream
Amit Chakrabarti, Graham Cormode, Andrew McGregor
ACM Transactions on Algorithms,
v6,
#3,
2010.
[PDF]
[BIB]
We describe a simple algorithm for approximating the empirical entropy
of a stream of $m$ values up to a multiplicative factor of $(1+\epsilon)$ using a single pass, $O(\epsilon^{-2} \log
(\delta^{-1}) \log m)$ words of space, and $O(\log \epsilon^{-1} + \log
\log \delta^{-1} + \log \log m)$ processing time per item in the
stream. Our algorithm is based upon a novel extension of a method
introduced by Alon, Matias, and Szegedy. This improves
over previous work on this problem.
We show a space lower bound of $\Omega({\epsilon^{-2}/ \log^2
(\epsilon^{-1})})$, demonstrating that our algorithm is near-optimal in terms of
its dependency on $\epsilon$.
We show that generalizing to multiplicative-approximation of the $k$-th order entropy requires close to
linear space for $k\geq 1$. In contrast we show that additive-approximation is possible in a single pass using only poly-logarithmic space. Lastly, we show how to compute a multiplicative
approximation to the entropy of a random walk on an undirected graph.

A Manifesto for Modeling and Measurement in Social Media
Graham Cormode, Balachander Krishnamurthy, Walter Willinger
First Monday,
v15,
#9,
2010.
[PDF]
[BIB]
Online Social Networks (OSNs) have been the subject of a great deal of study in recent years. The
majority of this study has used simple models, such as node-and-edge graphs, to describe the data. In
this paper, we argue that such models, which necessarily limit the structures that can be described and
omit temporal information, are insufficient to describe and study OSNs. Instead, we propose that a richer
class of Entity Interaction Network models should be adopted. We outline a checklist of features that
can help build such a model, and apply it to three popular networks (Twitter, Facebook and YouTube) to
highlight important features. We also discuss important considerations for the collection, validation and
sharing of OSN data.
Time-decaying Sketches for Robust Aggregation of Sensor Data
Graham Cormode, Srikanta Tirthapura, Bojian Xu
SIAM Journal on Computing (SICOMP),
v39,
#4,
pp 1309-1339,
2009.
[PDF]
[BIB]
We present a new sketch for summarizing network data. The sketch has
the following properties which make it useful in
communication-efficient aggregation in distributed streaming
scenarios, such as sensor networks: the sketch is
duplicate-insensitive, i.e. re-insertions of the same data will not
affect the sketch, and hence the estimates of aggregates. Unlike
previous duplicate-insensitive sketches for sensor data
aggregation, it is also time-decaying, so that
the weight of a data item in the sketch can decrease with time
according to a user-specified decay function. The sketch can give
provably approximate guarantees for various aggregates of data,
including the sum, median, quantiles, and frequent elements. The size
of the sketch and the time taken to update it are both polylogarithmic
in the size of the relevant data. Further, multiple sketches computed
over distributed data can be combined without loss of accuracy. To our
knowledge, this is the first sketch that combines all the above
properties.

Time-decayed correlated aggregates over data streams
Graham Cormode, Srikanta Tirthapura, Bojian Xu
Statistical Analysis and Data Mining,
v2,
#5-6,
pp 294-310,
2009.
[PDF]
[BIB]
Data stream analysis frequently relies on identifying correlations and posing conditional queries on the data after it has been seen. Correlated aggregates form an important example of such queries, which ask for an aggregation over one dimension of stream elements which satisfy a predicate on another dimension. Since recent events are typically more important than older ones, time decay should also be applied to downweight less significant values. We present space-efficient algorithms as well as space lower bounds for the time-decayed correlated sum, a problem at the heart of many related aggregations. By considering different fundamental classes of decay functions, we separate cases where efficient approximations with relative error or additive error guarantees are possible, from other cases where linear space is necessary to approximate. In particular, we show that no efficient algorithms with relative error guarantees are possible for the popular sliding window and exponential decay models, resolving an open problem. This negative result for the exponential decay holds even if the stream is allowed to be processed in multiple passes. The results are surprising, since efficient approximations are known for other data stream problems under these decay models. This is a step toward better understanding which sophisticated queries can be answered on massive streams using limited memory and computation.

Time-Decayed Correlated Aggregates over Data Streams
Graham Cormode, Srikanta Tirthapura, Bojian Xu
SIAM Conference on Data Mining (SDM),
2009.
[PDF]
[BIB]
Data stream analysis frequently relies on identifying correlations
and posing conditional queries on the data after it has been seen.
Correlated aggregates form an important example of such queries,
which ask for an aggregation over one dimension of stream elements
which satisfy a predicate on another dimension. Since recent
events are typically more important than older ones, time decay
should also be applied to downweight less significant values. We
present space-efficient algorithms as well as space lower bounds
for the time-decayed correlated sum, a problem at the heart of many
related aggregations. By considering different fundamental classes
of decay functions, we separate cases where efficient relative error
or additive error is possible, from other cases where linear space is
necessary to approximate. In particular, we show that no efficient
algorithms are possible for the popular sliding window and exponential
decay models, resolving an open problem. The results are
surprising, since efficient approximations are known for other data
stream problems under these decay models. This is a step towards
better understanding which sophisticated queries can be answered
on massive streams using limited memory and computation.

Space-optimal Heavy Hitters with Strong Error Bounds
Radu Berinde, Graham Cormode, Piotr Indyk, Martin Strauss
ACM Principles of Database Systems (PODS),
2009.
[PDF]
[BIB]
The problem of finding heavy-hitters and approximating the frequencies
of items is at the heart of many problems in data stream analysis. It
has been observed that several proposed solutions to this problem can
outperform their worst-case guarantees on real data. This leads to the
question of whether some stronger bounds can be guaranteed. We answer
this in the positive by showing that a class of ``counter-based
algorithms'' (including the popular and very space-efficient Frequent
and Spacesaving algorithms) provide much stronger approximation
guarantees than previously known. Specifically, we show that errors in
the approximation of individual elements do not depend on the
frequencies of the most frequent elements, but only on the frequency
of the remaining ``tail''.
This shows that counter-based methods are the
most space-efficient (in fact, space-optimal) algorithms having this
strong error bound.
This tail guarantee allows these algorithms to solve the ``sparse
recovery'' problem. Here, the goal is to recover a faithful
representation of the vector of frequencies, $f$. We prove that using
space $O(k)$, the algorithms construct an approximation $f^*$ to the
frequency vector $f$ so that the L1 error $||f-f^*||_1$ is close to the best
possible error $\min_{f'} ||f' - f||_1$, where $f'$ ranges over all vectors
with at most $k$ non-zero entries. This improves the previously best
known space bound of about $O(k \log n)$ for streams without element
deletions. Other consequences of the tail guarantees are results for skewed
(Zipfian) data, and guarantees for accuracy of merging multiple
summarized streams.

Small synopses for group-by query verification on outsourced data streams
Ke Yi, Feifei Li, Graham Cormode, Marios Hadjieleftheriou, George Kollios, Divesh Srivastava
ACM Transactions on Database Systems,
v34,
#3,
2009.
[PDF]
[BIB]
Due to the overwhelming flow of information in many data stream
applications, data outsourcing is a natural and effective paradigm
for individual businesses to address the issue of scale. In the standard
data outsourcing model, the data owner outsources streaming data to one or
more third-party servers, which answer queries posed by a potentially
large number of clients on the data owner's behalf. Data outsourcing
intrinsically raises issues of trust, making outsourced query assurance on
data streams a problem with important practical implications. Existing
solutions proposed in this model all build upon cryptographic primitives
such as signatures and collision-resistant hash functions, which only work for
certain types of queries, e.g., simple selection/aggregation queries.
In this paper, we consider another common type of queries, namely, ``{\tt
GROUP BY, SUM}'' queries, which previous techniques fail to support.
Our new solutions are not based on cryptographic primitives, but instead
use algebraic and probabilistic techniques to compute a small synopsis on
the true query result, which is then communicated to the client so as to
verify the correctness of the query result returned by the server. The
synopsis uses a constant amount of space irrespective of the result
size, has an extremely small probability of failure, and can be maintained
using no extra space when the query result changes as elements stream by.
We then generalize our synopsis to allow some tolerance on the number of
erroneous groups, in order to support semantic load shedding on the server.
When the number of erroneous groups is indeed tolerable, the synopsis can
be strengthened so that we can locate and even correct these errors.
Finally, we implement our techniques and perform an empirical evaluation
using live network traffic.

Semantics of Ranking Queries for Probabilistic Data and Expected Ranks
Graham Cormode, Feifei Li, Ke Yi
International Conference on Data Engineering (ICDE),
2009.
[PDF]
[BIB]
When dealing with massive quantities of data, topk
queries are a powerful technique for returning only the k
most relevant tuples for inspection, based on a scoring function.
The problem of efficiently answering such ranking queries has
been studied and analyzed extensively within traditional database
settings. The importance of the top-k is perhaps even greater in
probabilistic databases, where a relation can encode exponentially
many possible worlds. There have been several recent attempts
to propose definitions and algorithms for ranking queries over
probabilistic data. However, these all lack many of the intuitive
properties of a top-k over deterministic data. Specifically, we
define a number of fundamental properties, including exact-k,
containment, unique-rank, value-invariance, and stability, which
are all satisfied by ranking queries on certain data. We argue
that all these conditions should also be fulfilled by any reasonable
definition for ranking uncertain data. Unfortunately, none of the
existing definitions is able to achieve this.
To remedy this shortcoming, this work proposes an intuitive
new approach of expected rank. This uses the well-founded notion
of the expected rank of each tuple across all possible worlds
as the basis of the ranking. We are able to prove that, in
contrast to all existing approaches, the expected rank satisfies
all the required properties for a ranking query. We provide
efficient solutions to compute this ranking across the major
models of uncertain data, such as attribute-level and tuple-level
uncertainty. For an uncertain relation of N tuples, the processing
cost is O(N logN)no worse than simply sorting the relation.
In settings where there is a high cost for generating each tuple
in turn, we show pruning techniques based on probabilistic tail
bounds that can terminate the search early and guarantee that
the top-k has been found. Finally, a comprehensive experimental
study confirms the effectiveness of our approach.

Probabilistic Histograms for Probabilistic Data
Graham Cormode, Antonios Deligiannakis, Minos Garofalakis, Andrew McGregor
International Conference on Very Large Data Bases (VLDB),
2009.
[BIB]
There is a growing realization that
modern database management systems (DBMSs) must be able
to manage data that contains {\em uncertainties} that are represented in the
form of probabilistic relations.
Consequently, the design of each
core DBMS component must be revisited in
the presence of uncertain and probabilistic information.
In this paper, we study how to build histogram synopses for probabilistic
relations, for the purposes of enabling both DBMS-internal decisions (such as
indexing and query planning), and (possibly, user-facing) approximate query
processing tools.
In contrast to initial work in this area, our {\em probabilistic histograms\/}
retain the key
{\em possible-worlds semantics\/} of probabilistic data, allowing for
more accurate, yet concise, representation of the uncertainty
characteristics of data and query results.
We present a variety of techniques for building optimal
probabilistic histograms, each one tuned to a different choice
of approximation-error metric.
We show that these can be incorporated into a general Dynamic Programming (DP)
framework, which generalizes that used for existing histogram constructions.
The end result is a histogram where each ``bucket'' is approximately represented by a
compact probability distribution function (PDF), which can
be used as the basis for query planning and approximate query answering.
We present novel, polynomial-time algorithms to find optimal probabilistic
histograms for a variety of PDF-error metrics (including
variation distance, sum squared error, max error,
and variants of EMD).
Our experimental study shows that our probabilistic histogram synopses can
accurately capture the key statistical properties of uncertain data, while being
much more compact to store and work with than the original uncertain
relations.

Histograms and Wavelets on Probabilistic Data
Graham Cormode, Minos Garofalakis
International Conference on Data Engineering (ICDE),
2009.
[PDF]
[BIB]
There is a growing realization that uncertain information
is a first-class citizen in modern database management.
As such, we need techniques to correctly and efficiently
process uncertain data in database systems. In particular, data
reduction techniques that can produce concise, accurate synopses
of large probabilistic relations are crucial. Similar to their
deterministic relation counterparts, such compact probabilistic
data synopses can form the foundation for human understanding
and interactive data exploration, probabilistic query planning
and optimization, and fast approximate query processing in
probabilistic database systems.
In this paper, we introduce definitions and algorithms for
building histogram- and Haar wavelet-based synopses on probabilistic
data. The core problem is to choose a set of histogram
bucket boundaries or wavelet coefficients to optimize the accuracy
of the approximate representation of a collection of probabilistic
tuples under a given error metric. For a variety of different
error metrics, we devise efficient algorithms that construct
optimal or near optimal size B histogram and wavelet synopses.
This requires careful analysis of the structure of the probability
distributions, and novel extensions of known dynamicprogramming-
based techniques for the deterministic domain.
Our experiments show that this approach clearly outperforms
simple ideas, such as building summaries for samples drawn
from the data distribution, while taking equal or less time.

Forward Decay: A Practical Time Decay Model for Streaming Systems
Graham Cormode, Vladislav Shkapenyuk, Divesh Srivastava, Bojian Xu
International Conference on Data Engineering (ICDE),
2009.
[PDF]
[BIB]
Temporal data analysis in data warehouses and data
streaming systems often uses time decay to reduce the importance
of older tuples, without eliminating their influence, on the results
of the analysis. Exponential time decay is commonly used in
practice, even though other decay functions (e.g., polynomial
decay) have also been identified as useful. We argue that this
is because the usual definitions of time decay are backwards:
the decayed weight of a tuple is based on its age, measured
backward from the current time. Since this age is constantly
changing, such decay is too complex and unwieldy for scalable
implementation.
In this paper, we propose a new class of forward decay
functions based on measuring forward from a fixed point in
time. We show that this model captures the more practical
models already known, such as exponential decay and landmark
windows, but also includes a wide class of other types of time
decay. We provide efficient algorithms to compute a variety of
aggregates and draw samples under forward decay, and show
that these are easy to implement scalably. Further, we provide
empirical evidence that these can be executed in a production
data stream management system with little or no overhead
compared to the undecayed computations. Our implementation
required no extensions to the query language or the DSMS,
demonstrating that forward decay represents a practical model
of time decay for systems that deal with time-based data.

Finding the frequent items in streams of data
Graham Cormode, Marios Hadjieleftheriou
Communications of the ACM (CACM),
v52,
#10,
pp 97-105,
2009.
[PDF]
[BIB]
The frequent items problem is to process a stream of items and find all
those which occur more than a given fraction of the time.
It is one of the most heavily studied problems in mining data streams,
dating back to the 1980s.
Many other applications rely directly or indirectly on finding the frequent
items, and implementations are in use in large scale industrial systems.
In this paper, we describe the most important algorithms for
this problem in a common framework.
We place the different solutions in their historical context, and
describe the connections between them, with the aim of clarifying some
of the confusion that has surrounded their properties.
To further illustrate the different properties of the algorithms,
we provide baseline implementations.
This allows us to
give empirical evidence that there is considerable variation in the
performance of frequent item algorithms.
The best methods can be implemented to find frequent items with high
accuracy using only tens of kilobytes of memory, at rates of millions of
items per second on cheap modern hardware.

Estimating the Confidence of Conditional Functional Dependencies
Graham Cormode, Lukasz Golab, Flip Korn, Andrew McGregor, Divesh Srivastava, Xi Zhang
ACM SIGMOD International Conference on Management of Data (SIGMOD),
2009.
[PDF]
[BIB]
Conditional functional dependencies (CFDs) have recently been proposed as
extensions of classical functional dependencies that apply to a certain
subset of the relation, as specified by a \emph{pattern tableau}.
Calculating the support and confidence of a CFD (i.e., the size of the
applicable subset and the extent to which it satisfies the CFD)
gives valuable information about data semantics and data quality.
While computing the support is easier, computing the confidence
exactly is expensive if the relation is large, and estimating it from a
random sample of the relation is unreliable unless the sample is large.
We study how to efficiently estimate the confidence of a CFD
with a small number of passes (one or two) over the input using small space.
Our solutions are based on a variety of sampling and sketching techniques,
and apply when the pattern tableau is known in advance, and also the harder
case when this is given after the data have been seen.
We analyze our algorithms, and show that they can guarantee
a small additive error; we also show that relative errors guarantees are not possible.
We demonstrate the power of these methods empirically, with a detailed study over
a mixture of real and synthetic data.
These experiments show that it is possible to estimates the CFD confidence
very accurately with summaries which are much smaller than the size of the data they represent.

Class-based graph anonymization for social network data
Smriti Bhagat, Graham Cormode, Balachander Krishnamurthy, Divesh Srivastava
International Conference on Very Large Data Bases (VLDB),
2009.
[PDF]
[BIB]
The recent rise in popularity of social networks, such as Facebook and
MySpace, has created large quantities of data about interactions
within these networks. Such data contains many private details about
individuals so anonymization is required prior to attempts to make
the data more widely available for scientific research.
Prior work has considered simple graph data to be anonymized
by removing all non-graph information and adding or deleting some edges.
Since
social network data is richer in details about the users and their
interactions, loss of details due to anonymization limits the
possibility for analysis.
%%
We present a new set of techniques for
anonymizing social network data based on grouping the entities into
classes, and masking the mapping between entities and the nodes that
represent them in the anonymized graph. Our techniques allow queries
over the rich data to be evaluated with high accuracy while
guaranteeing resilience to certain types of attack.
To prevent inference of interactions, we rely on a critical
``safety condition'' when forming these classes.
We demonstrate utility via empirical data from social networking
settings. We give examples of complex
queries that may be posed and show that they can be answered over the
anonymized data efficiently and accurately.

Annotations in Data Streams
Amit Chakrabarti, Graham Cormode, Andrew McGregor
International Colloquium on Automata, Languages and Programming (ICALP),
2009.
[PDF]
[BIB]
The central goal of data stream algorithms is to process massive streams
of data using {\em sublinear} storage space.
Motivated by work in the database community on outsourcing database and data stream processing, we ask whether the space usage of such algorithms be further
reduced by enlisting a more powerful ``helper'' who can {\em annotate}
the stream as it is read. We do not wish to blindly trust the helper, so we
require that the algorithm be convinced of having computed a correct
answer.
We show
upper bounds that achieve a non-trivial tradeoff between the
amount of annotation used and the space required to verify it. We also
prove lower bounds on such tradeoffs, often nearly matching the upper
bounds, via notions related to Merlin-Arthur communication complexity.
Our results cover the classic data stream problems of selection,
frequency moments, and fundamental graph problems such as
triangle-freeness and connectivity.
Our work is also part of a
growing trend --- including recent studies of multi-pass streaming,
read/write streams and randomly ordered streams --- of asking more
complexity-theoretic questions about data stream processing.
It is a recognition that, in addition to practical relevance, the
data stream model
raises many interesting theoretical questions in its own right.

Time-Decaying Aggregates in Out-of-order Streams
Graham Cormode, Flip Korn, Srikanta Tirthapura
ACM Principles of Database Systems (PODS),
2008.
[PDF]
[BIB]
Processing large data streams is now a major topic in data management.
The data involved can be truly massive, and the required
analyses complex. In a stream of sequential events such as stock
feeds, sensor readings, or IP traffic measurements, data tuples pertaining
to recent events are typically more important than older
ones. This can be formalized via time-decay functions, which assign
weights to data based on the age of data. Decay functions
such as sliding windows and exponential decay have been studied
under the assumption of well-ordered arrivals, i.e., data arrives in
non-decreasing order of time stamps. However, data quality issues
are prevalent in massive streams (due to network asynchrony and
delays etc.), and correct arrival order is not guaranteed.
We focus on the computation of decayed aggregates such as
range queries, quantiles, and heavy hitters on out-of-order streams,
where elements do not necessarily arrive in increasing order of
timestamps. Existing techniques such as Exponential Histograms
and Waves are unable to handle out-of-order streams. We give the
first deterministic algorithms for approximating these aggregates
under popular decay functions such as sliding window and polynomial
decay. We study the overhead of allowing out-of-order arrivals
when compared to well-ordered arrivals, both analytically and experimentally.
Our experiments confirm that these algorithms can be
applied in practice, and compare the relative performance of different
approaches for handling out-of-order arrivals.

Time-Decaying Aggregates in Out-of-order Streams
Graham Cormode, Flip Korn, Srikanta Tirthapura
ACM Principles of Database Systems (PODS),
2008.
[PDF]
[BIB]
Processing large data streams is now a major topic in data management.
The data involved can be truly massive, and the required
analyses complex. In a stream of sequential events such as stock
feeds, sensor readings, or IP traffic measurements, data tuples pertaining
to recent events are typically more important than older
ones. This can be formalized via time-decay functions, which assign
weights to data based on the age of data. Decay functions
such as sliding windows and exponential decay have been studied
under the assumption of well-ordered arrivals, i.e., data arrives in
non-decreasing order of time stamps. However, data quality issues
are prevalent in massive streams (due to network asynchrony and
delays etc.), and correct arrival order is not guaranteed.
We focus on the computation of decayed aggregates such as
range queries, quantiles, and heavy hitters on out-of-order streams,
where elements do not necessarily arrive in increasing order of
timestamps. Existing techniques such as Exponential Histograms
and Waves are unable to handle out-of-order streams. We give the
first deterministic algorithms for approximating these aggregates
under popular decay functions such as sliding window and polynomial
decay. We study the overhead of allowing out-of-order arrivals
when compared to well-ordered arrivals, both analytically and experimentally.
Our experiments confirm that these algorithms can be
applied in practice, and compare the relative performance of different
approaches for handling out-of-order arrivals.

Summarizing Two-Dimensional Data with Skyline-based Statistical Descriptors
Graham Cormode, Flip Korn, S. Muthukrishnan, Divesh Srivastava
Scientific and Statistical Database Management (SSDBM),
2008.
[PDF]
[BIB]
Much real data consists of more than one dimension,
such as financial transactions (eg, price $\times$ volume)
and IP network flows (eg, duration $\times$ numBytes),
and capture relationships between the variables.
For a single dimension, quantiles are intuitive and robust descriptors.
Processing and analyzing such data, particularly in data warehouse or data
streaming settings, requires similarly robust and informative
statistical descriptors that go beyond one-dimension.
Applying quantile methods to summarize a multidimensional
distribution along only singleton attributes
ignores the rich dependence amongst the variables.
In this paper, we present new skyline-based statistical
descriptors for capturing the distributions over pairs of dimensions.
They generalize the notion of quantiles in the individual
dimensions, and also incorporate properties of the joint distribution.
We introduce {\em $\phi$-quantours} and {\em $\alpha$-radials},
which are skyline points over subsets of the data, and propose
{\em ($\phi, \alpha$)-quantiles}, found from the union of these skylines,
as statistical descriptors of two-dimensional distributions.
We present efficient online algorithms for tracking
($\phi,\alpha$)-quantiles on two-dimensional streams
using guaranteed small space.
We identify the principal properties of the proposed descriptors
and perform extensive experiments with synthetic and real IP traffic data
to study the efficiency of our
proposed algorithms.

Robust Lower Bounds for Communication and Stream Computation
Amit Chakrabarti, Graham Cormode, Andrew McGregor
ACM Symposium on Theory of Computing (STOC),
2008.
[PDF]
[BIB]
In this paper we study the communication complexity of evaluating
functions when the input data is distributed (according to some known
distribution) to two or more players. This naturally extends
previous models such as the best-case and worst-case partition models
to a model that captures random
partitions and distributions with redundancy where bits of data may
be known to more than one player.
This allows us to distinguish functions which require significant
communication for almost every allocation of input from those for
which only a few cases are hard.
There is also a strong connection to proving lower-bounds in the data
stream model that are ``robust'' to the ordering of the data.
That is, we prove lower-bounds for when
order of the data-stream is not chosen adversarially
but rather is determined uniformly (or near-uniformly)
from the set of all permuations.
This streaming model has attracted recent interest, since lower bounds
here give stronger evidence for the inherent hardness of streaming
problems.
Our results include the first average-partition communication
lower-bounds for problems including multi-party set-disjointness and
gap-hamming. Both are tight. We also improve extend and improve
previous results for a form of pointer-jumping
that is relevant to the problem of selection. Collectively,
these results yield lower-bounds for a
variety of problems in the random-order, data-stream model including
estimating the number of distinct elements, approximating frequency
moments, and quantile estimation.

On Signatures for Communication Graphs
Graham Cormode, Flip Korn, S. Muthukrishnan, Yihua Wu
International Conference on Data Engineering (ICDE),
2008.
[PDF]
[BIB]
Communications between individuals can be represented by (weighted,
multi-)
graphs.
Many applications operate on communication graphs
associated with telephone calls, emails, Instant Messages (IM),
blogs, web forums, e-business relationships and so on.
These applications include identifying repetitive
fraudsters, message board aliases, multihoming of IP addresses, etc.
Tracking such electronic
identities on communication network can be achieved if we have a
reliable ``signature'' for nodes and activities. While many examples
of adhoc signatures can be proposed for particular tasks, what is needed
is a systematic study of the principles behind the usage of signatures
in any task.
We develop a formal framework for the use of signatures
on communication graphs and identify three fundamental properties
that are natural to signature schemes: persistence, uniqueness and
robustness. We justify these properties by showing how they impact
a set of applications. We then explore several signature
schemes --- previously defined or new --- in our framework
and evaluate them on real data in terms
of these properties. This provides insights into suitable
signature schemes for desired applications.
Finally, as case studies, we focus on the concrete application
of enterprise network traffic.

Key Differences between Web 1.0 and Web 2.0
Graham Cormode, Balachander Krishnamurthy
First Monday,
v13,
#6,
2008.
[PDF]
[BIB]
Web 2.0 is a buzzword introduced in 200304 which is commonly used to encompass various novel
phenomena on the World Wide Web. Although largely a marketing term, some of the key attributes
associated with Web 2.0 include the growth of social networks, bidirectional communication,
various glue technologies, and significant diversity in content types. We are not aware of a
technical comparison between Web 1.0 and 2.0. While most of Web 2.0 runs on the same substrate as
1.0, there are some key differences. We capture those differences and their implications for
technical work in this paper. Our goal is to identify the primary differences leading to the
properties of interest in 2.0 to be characterized. We identify novel challenges due to the
different structures of Web 2.0 sites, richer methods of user interaction, new technologies, and
fundamentally different philosophy. Although a significant amount of past work can be reapplied,
some critical thinking is needed for the networking community to analyze the challenges of this
new and rapidly evolving environment.

How NOT to review a paper: The tools and techniques of the adversarial reviewer
Graham Cormode
SIGMOD Record,
v37,
#4,
pp 100-104,
2008.
[PDF]
[BIB]
There are several useful guides available for how to review a paper in
Computer Science.
These are soberly presented, carefully reasoned and sensibly argued.
As a result, they are not much fun.
So, as a contrast, this note is a checklist of how {\em not} to review
a paper.
It details techniques that are
unethical, unfair, or just plain nasty.
Since in Computer Science we often present arguments about how an
adversary would approach a particular problem, this note
describes the adversary's strategy.
Finding Hierarchical Heavy Hitters in Streaming Data
Graham Cormode, Flip Korn, S. Muthukrishnan, Divesh Srivastava
ACM Transactions on Knowledge Discovery from Data (TKDD),
v1,
#4,
2008.
[PDF]
[BIB]
Data items that arrive online as streams
typically have attributes which take values from one or more
hierarchies (time and geographic location; source and
destination IP addresses; etc.). Providing an aggregate view of such
data is important for summarization, visualization, and analysis.
We develop an aggregate view based on certain
organized sets of large-valued regions (``heavy hitters'')
corresponding to hierarchically discounted frequency counts.
We formally define the notion of {\em Hierarchical Heavy Hitters} (HHHs).
We first consider computing (approximate) HHHs over a data stream
drawn from a single hierarchical attribute.
We formalize the problem and give deterministic algorithms
to find them in a single pass over the input.
In order to analyze a wider range of realistic data streams
(e.g., from IP traffic monitoring applications),
we generalize this problem to multiple dimensions.
Here, the semantics of HHHs are more complex, since a ``child'' node
can have multiple ``parent'' nodes.
We present online algorithms that find approximate HHHs in one pass,
with provable accuracy guarantees.
The product of hierarchical dimensions form a
mathematical lattice structure.
Our algorithms exploit this structure, and so are able to
to track approximate
HHHs using only a small, fixed number of statistics per stored item,
regardless of the number of dimensions.
We show experimentally, using real data, that our
proposed algorithms yield outputs which are very similar
(virtually identical, in many cases) to offline computations
of the exact solutions whereas straightforward heavy hitters
based approaches give significantly inferior answer quality.
Furthermore, the proposed algorithms result in an order of
magnitude savings in data structure size while performing
competitively.

Finding Frequent Items in Data Streams
Graham Cormode, Marios Hadjieleftheriou
International Conference on Very Large Data Bases (VLDB),
2008.
[PDF]
[BIB]
The frequent items problem is to process a stream of items and find
all items occurring more than a given fraction of the time. It is one
of the most heavily studied problems in data stream mining, dating
back to the 1980s. Many applications rely directly or indirectly
on finding the frequent items, and implementations are in use in
large scale industrial systems. However, there has not been much
comparison of the different methods under uniform experimental
conditions. It is common to find papers touching on this topic in
which important related work is mischaracterized, overlooked, or
reinvented.
In this paper, we aim to present the most important algorithms
for this problem in a common framework. We have created baseline
implementations of the algorithms, and used these to perform a
thorough experimental study of their properties. We give empirical
evidence that there is considerable variation in the performance of
frequent items algorithms. The best methods can be implemented
to find frequent items with high accuracy using only tens of kilobytes
of memory, at rates of millions of items per second on cheap
modern hardware.

Exponentially Decayed Aggregates on Data Streams
Graham Cormode, Flip Korn, Srikanta Tirthapura
International Conference on Data Engineering (ICDE),
2008.
[PDF]
[BIB]
In a massive stream of sequential events such as
stock feeds, sensor readings, or IP traffic measurements, tuples
pertaining to recent events are typically more important than
older ones. It is important to compute various aggregates over
such streams after applying a decay function which assigns
weights to tuples based on their age. We focus on the computation
of exponentially decayed aggregates in the form of quantiles and
heavy hitters. Our techniques are based on extending existing
data stream summaries, such as the q-digest and the``spacesaving''
algorithm. Our experiments confirm that our methods
can be applied in practice, and have similar space and time costs
to the non-decayed aggregate computation.
Approximation Algorithms for Clustering Uncertain Data
Graham Cormode, Andrew McGregor
ACM Principles of Database Systems (PODS),
2008.
[PDF]
[BIB]
There is an increasing quantity of data with uncertainty arising from
applications such as sensor network measurements, record linkage, and
as output of mining algorithms.
This uncertainty is typically formalized as probability density
functions over tuple values.
Beyond storing and processing such data in a DBMS, it is necessary to
perform other data analysis tasks such as data mining.
We study the core mining problem of {\em clustering} on uncertain
data, and define appropriate natural generalizations of standard
clustering optimization criteria.
Two variations arise, depending on whether a point is automatically
associated with its optimal center, or whether it must be assigned to
a fixed cluster no matter where it is actually located.
For uncertain versions of $k$-means and $k$-median, we show reductions
to their corresponding weighted versions on data with no
uncertainties.
These are simple in the unassigned case, but require some care for the
assigned version.
Our most interesting results are for uncertain $k$-center,
which generalizes both traditional $k$-center and $k$-median objectives.
We show a variety of bicriteria approximation algorithms.
One picks ${poly}(\log n)$ more centers and achieves a
$(1+\epsilon)$ approximation to the best uncertain $k$-centers.
Another picks $2k$ centers and achieves a constant factor
approximation.
Collectively, these results are the first known guaranteed
approximation algorithms for the problems of clustering uncertain
data.

Approximate continuous querying over distributed streams
Graham Cormode, Minos Garofalakis
ACM Transactions on Database Systems,
v33,
#2,
2008.
[BIB]
While traditional database systems optimize for performance
on one-shot query processing, emerging large-scale monitoring
applications require continuous tracking of complex data-analysis
queries over collections of physically-distributed streams.
Thus, effective solutions have to be simultaneously space/time
efficient (at each remote monitor site), communication efficient
(across the underlying communication network), and provide
continuous, guaranteed-quality approximate query answers.
In this paper, we propose novel algorithmic solutions for the
problem of continuously tracking a broad class of complex
aggregate queries in such a distributed-streams setting.
Our tracking schemes maintain approximate query answers with
provable error guarantees, while simultaneously optimizing the
storage space and processing time at each remote site,
and the communication cost across the network.
In a nutshell, our algorithms rely on tracking general-purpose
randomized sketch summaries of local streams at remote sites
along with concise prediction models of local site behavior
in order to produce highly communication- and space/time-efficient
solutions.
The end result is a powerful approximate query tracking framework
that readily incorporates several complex analysis queries
(including distributed join and multi-join aggregates, and
approximate wavelet representations), thus giving the first
known low-overhead tracking solution for such queries in the
distributed-streams model.
Experiments with real data validate our approach, revealing
significant savings over naive solutions as well as our
analytical worst-case guarantees.

Anonymizing Bipartite Graph Data using Safe Groupings
Graham Cormode, Divesh Srivastava, Ting Yu, Qing Zhang
International Conference on Very Large Data Bases (VLDB),
2008.
[PDF]
[BIB]
Private data often comes in the form of associations between entities, such
as customers and products bought from a pharmacy, which are naturally
represented in the form of a large, sparse bipartite graph. As with tabular
data, it is desirable to be able to publish anonymized versions of such data,
to allow others to perform ad hoc analysis of aggregate graph properties.
However, existing tabular anonymization techniques do not give useful or
meaningful results when applied to graphs: small changes or masking of
the edge structure can radically change aggregate graph properties.
We introduce a new family of anonymizations, for bipartite graph data,
called (k, l)-groupings. These groupings preserve the underlying graph
structure perfectly, and instead anonymize the mapping from entities to
nodes of the graph. We identify a class of safe (k, l)-groupings that have
provable guarantees to resist a variety of attacks, and show how to find such
safe groupings. We perform experiments on real bipartite graph data to
study the utility of the anonymized version, and the impact of publishing
alternate groupings of the same graph data. Our experiments demonstrate
that (k, l)-groupings offer strong tradeoffs between privacy and utility.

Algorithms for Distributed, Functional Monitoring
G. Cormode, S. Muthukrishnan, K. Yi
ACM-SIAM Symposium on Discrete Algorithms (SODA),
2008.
[PDF]
[BIB]
We study what we call {\em functional monitoring} problems.
We have
$k$ players each tracking their inputs, say $A_j(t)$ up until time $t$,
and communicating with a central coordinator.
The coordinator's task is to
compute a given function $f$ over the union of the inputs
$\cup_{i} A_j(t)$, {\em continuously} at all times $t$.
The goal is to minimize the number of bits communicated between the
players and the coordinator.
A simple example is when $f$ is simply the sum, and the coordinator
is required to alert when the sum of a distributed
set of values exceeds a given threshold.
Of interest is the approximate version where the coordinator outputs
$1$ if $f \geq \tau$ and $0$ is $f\leq \tau-\epsilon$.
This defines
the $(k,f,\tau,\epsilon)$ distributed, functional monitoring problem.
Functional monitoring problems are fundamental in distributed systems,
in particular sensor networks, where we must minimize communication; they also
connect to problems in communication complexity, communication theory,
and signal processing. Yet few formal bounds are known for
functional monitoring.
We give upper and lower bounds for
the $(k,f,\tau,\epsilon)$ problem for foundational $f$'s.
In particular, we study frequency moments ($F_0, F_1, F_2$).
We show that for the simplest cases, the cost of monitoring a
function can be the same as the cost of its one-time computation.
However, for $F_2$ we show a carefully constructed multi-round
algorithm which uses ``sketch summaries'' at multiple levels of detail
and solves the $(k,F_2,\tau,\epsilon)$ problem
with communication ${O}$~$(k^2/\epsilon + k^{3/2}/\epsilon^3)$.
Since frequency moment estimation is central to other problems,
our results have immediate applications to histograms, wavelet computations,
and others. Our algorithmic techniques are likely to be useful for other
functional monitoring problems.

Algorithms for Distributed, Functional Monitoring
G. Cormode, S. Muthukrishnan, K. Yi
ACM-SIAM Symposium on Discrete Algorithms (SODA),
2008.
[PDF]
[BIB]
We study what we call {\em functional monitoring} problems.
We have
$k$ players each tracking their inputs, say $A_j(t)$ up until time $t$,
and communicating with a central coordinator.
The coordinator's task is to
compute a given function $f$ over the union of the inputs
$\cup_{i} A_j(t)$, {\em continuously} at all times $t$.
The goal is to minimize the number of bits communicated between the
players and the coordinator.
A simple example is when $f$ is simply the sum, and the coordinator
is required to alert when the sum of a distributed
set of values exceeds a given threshold.
Of interest is the approximate version where the coordinator outputs
$1$ if $f \geq \tau$ and $0$ is $f\leq \tau-\epsilon$.
This defines
the $(k,f,\tau,\epsilon)$ distributed, functional monitoring problem.
Functional monitoring problems are fundamental in distributed systems,
in particular sensor networks, where we must minimize communication; they also
connect to problems in communication complexity, communication theory,
and signal processing. Yet few formal bounds are known for
functional monitoring.
We give upper and lower bounds for
the $(k,f,\tau,\epsilon)$ problem for foundational $f$'s.
In particular, we study frequency moments ($F_0, F_1, F_2$).
We show that for the simplest cases, the cost of monitoring a
function can be the same as the cost of its one-time computation.
However, for $F_2$ we show a carefully constructed multi-round
algorithm which uses ``sketch summaries'' at multiple levels of detail
and solves the $(k,F_2,\tau,\epsilon)$ problem
with communication ${O}$~$(k^2/\epsilon + k^{3/2}/\epsilon^3)$.
Since frequency moment estimation is central to other problems,
our results have immediate applications to histograms, wavelet computations,
and others. Our algorithmic techniques are likely to be useful for other
functional monitoring problems.

Time-Decaying Sketches for Sensor Data Aggregation
G. Cormode, S. Tirthapura, B. Xu
ACM Principles of Distributed Computing (PODC),
2007.
[PDF]
[BIB]
We present a new sketch for summarizing network data.
The sketch has the following properties which make it useful in
communication-efficient aggregation in distributed streaming
scenarios, such as sensor networks: the sketch is duplicateinsensitive,
i.e. re-insertions of the same data will not affect
the sketch, and hence the estimates of aggregates. Unlike
previous duplicate-insensitive sketches for sensor data aggregation,
it is also time-decaying, so that the weight
of a data item in the sketch can decrease with time according
to a user-specified decay function. The sketch can give
provably approximate guarantees for various aggregates of
data, including the sum, median, quantiles, and frequent elements.
The size of the sketch and the time taken to update
it are both polylogarithmic in the size of the relevant data.
Further, multiple sketches computed over distributed data
can be combined with the same guarantees. To our knowledge,
this is the first sketch that combines all the above
properties.

The String Edit Distance Matching Problem with Moves
G. Cormode, S. Muthukrishnan
ACM Transactions on Algorithms,
v3,
#1,
2007.
[PDF]
[BIB]
The edit distance between two strings $S$ and $R$
is defined to be the minimum number of character inserts,
deletes and changes needed to convert $R$ to $S$.
Given a text string $t$ of length $n$, and a pattern string $p$
of length $m$, informally, the string edit distance matching problem is to
compute the smallest edit distance between $p$ and
substrings of $t$.
We relax the problem
so that (a) we allow an additional operation, namely,
{\em substring moves}, and (b) we
allow approximation of this string edit distance.
Our result is a near linear time deterministic
algorithm to produce a factor of $O(\log n \log^* n)$
approximation to the string edit distance with moves.
This is the first
known significantly subquadratic algorithm for a string
edit distance problem in which the distance involves
nontrivial alignments.
Our results are obtained by embedding strings into $L_1$
vector space using a simplified parsing technique we call
{\em Edit Sensitive Parsing} (ESP).

Sketching Probabilistic Data Streams
G. Cormode, M. Garofalakis
ACM SIGMOD International Conference on Management of Data (SIGMOD),
2007.
[PDF]
[BIB]
The management of uncertain, probabilistic data has
recently emerged
as a useful paradigm for dealing with the inherent unreliabilities
of several real-world application domains, including data cleaning,
information integration, and pervasive, multi-sensor computing.
Unlike conventional data sets, a set of probabilistic tuples defines
a probability distribution over an exponential number of possible
worlds (i.e., grounded, deterministic databases). This possible
worlds interpretation allows for clean query semantics but
also raises hard computational problems for probabilistic database
query processors. To further complicate matters, in many scenarios
(e.g., large-scale process and environmental monitoring using multiple
sensor modalities), probabilistic data tuples arrive and need to
be processed in a streaming fashion; that is, using limited memory
and CPU resources and without the benefit of multiple passes over a
static probabilistic database. Such probabilistic data streams raise a
host of new research challenges for stream-processing engines that,
to date, remain largely unaddressed.
In this paper, we propose the first space- and time-efficient algorithms
for approximating complex aggregate queries (including,
the number of distinct values and join/self-join sizes) over probabilistic
data streams. Following the possible-worlds semantics,
such aggregates essentially define probability distributions over the
space of possible aggregation results, and our goal is to characterize
such distributions through efficient approximations of their key
moments (such as expectation and variance). Our algorithms offer
strong randomized estimation guarantees while using only sublinear
space in the size of the stream(s), and rely on novel, concise
streaming sketch synopses that extend conventional sketching ideas
to the probabilistic streams setting. Our experimental results verify
the effectiveness of our approach.

On Estimating Frequency Moments of Data Streams
S. Ganguly, G. Cormode
Proceedings of RANDOM,
2007.
[PDF]
[BIB]
Space-economical estimation of the $p$th frequency moments, defined as
$F_p = \Sigma_{i=1}^n |{f_i}|^p$, for $p > 0$, are of interest in
estimating all-pairs distances in a large data
matrix,
machine learning, and in data stream computation.
Random sketches formed by the inner product of the
frequency vector $f_1, \ldots, f_n$ with a suitably chosen random
vector were pioneered by Alon, Matias and Szegedy, and
have since played a
central role in estimating $F_p$ and for data stream computations in
general.
The concept of $p$-stable sketches formed by the inner product of the
frequency
vector with a random vector whose components are drawn from a
$p$-stable distribution, was proposed by Indyk
for estimating $F_p$, for $0 < p < 2$, and has been further studied
by Li.
In this paper, we consider the problem of estimating $F_p$, for $0 <
p < 2$. A disadvantage of the stable sketches technique and its
variants is that they require $O(\frac{1}{\epsilon^2})$
inner-products of the frequency
vector with dense vectors of stable
(or nearly stable) random variables to be maintained.
This means that each stream update can be quite time-consuming.
We present algorithms for estimating $F_p$, for $0 < p < 2$,
that does not require the use of stable sketches or its
approximations.
Our technique is elementary in nature, in that, it
uses simple randomization in conjunction with well-known summary
structures for data streams, such as the CM sketch
sketch and the Count sketch structure.
Our algorithms require space
${O}~(\frac{1}{\epsilon^{2+p}})$ to estimate $F_p$ to within $1 +/-
\epsilon$ factors and requires expected time $O(\log F_1 \log
\frac{1}{\delta})$ to process each update. Thus, our technique
trades an $O(\frac{1}{\epsilon^p})$ factor in space for
much more efficient processing of stream updates.
We also present a stand-alone iterative estimator for $F_1$.

No Blog is an Island Analyzing Connections Across Information Networks
S. Bhagat, G. Cormode, S. Muthukrishnan, I. Rozenbaum, H. Xue
International Conference on Weblogs and Social Media,
2007.
[PDF]
[BIB]
There are multiple information and social networks among individuals,
from telephone to email, web, blog, instant message (IM) and
chat networks. Prior work has studied each of these individual networks
quite extensively, including telephone networks, postal
network, web communities, and so on. Still, what is of
great interest is how these networks collide and interact with each
other. Each individual has presence in multiple networks and it will
be of interest to understand how the references and presence in one
affects that in the others. Previously such studies would have been
difficult to do since each source of data was often owned by a specific
entity. Blogs now provide a unique, public source of data that
naturally provides visibility into multiple networks. For example,
blog entries cite web pages, blogs have friends and community, as
well as blog profiles that have links to email, IM and other networks.
In this paper, we use the blogs as a starting point to pull in data about
these multiple networks and study how these multiple networks interact
with each other. While much of the connections are within
specific networks, there is still a wealth of information, structure and
form to the connections across these networks as our analyses show.

Conquering the Divide: Continuous Clustering of Distributed Data Streams
G. Cormode, S. Muthukrishnan, W. Zhuang
International Conference on Data Engineering (ICDE),
2007.
[PDF]
[BIB]
Data is often collected over a distributed network, but
in many cases, is so voluminous that it is impractical and
undesirable to collect it in a central location. Instead, we
must perform distributed computations over the data, guaranteeing
high quality answers even as new data arrives. In
this paper, we formalize and study the problem of maintaining
a clustering of such distributed data that is continuously
evolving. In particular, our goal is to minimize the communication
and computational cost, still providing guaranteed
accuracy of the clustering.
We focus on the k-center clustering, and provide a suite
of algorithms that vary based on which centralized algorithm
they derive from, and whether they maintain a single
global clustering or many local clusterings that can be
merged together. We show that these algorithms can be designed
to give accuracy guarantees that are close to the best
possible even in the centralized case. In our experiments,
we see clear trends among these algorithms, showing that
the choice of algorithm is crucial, and that we can achieve a
clustering that is as good as the best centralized clustering,
with only a small fraction of the communication required to
collect all the data in a single location.

Applying Link-based Classification to Label Blogs
S. Bhagat, G. Cormode, I. Rozenbaum
Joint WEBKDD and SNA-KDD Workshop,
2007.
[PDF]
[BIB]
In analyzing data from social and communication networks, we encounter
the problem of classifying objects where there is an explicit
link structure amongst the objects. We study the problem of inferring
the classification of all the objects from a labeled subset, using
only the link-based information amongst the objects.
We abstract the above as a labeling problem on multigraphs with
weighted edges. We present two classes of algorithms, based on
local and global similarities. Then we focus on multigraphs induced
by blog data, and carefully apply our general algorithms to
specifically infer labels such as age, gender and location associated
with the blog based only on the link-structure amongst them. We
perform a comprehensive set of experiments with real, large-scale
blog data sets and show that significant accuracy is possible from
little or no non-link information, and our methods scale to millions
of nodes and edges.

A Near-Optimal Algorithm for Computing the Entropy of a Stream
A. Chakrabarti, G. Cormode, A. McGregor
ACM-SIAM Symposium on Discrete Algorithms (SODA),
2007.
[PDF]
[BIB]
We describe a simple algorithm for computing the empirical entropy of
a stream of $m$ values in a single pass, using $O(\epsilon^{-2} \log
(\delta^{-1}) \log^2 m)$ words of space. Our algorithm is based upon
a novel extension of a method introduced by Alon, Matias, and Szegedy.
We show a space lower bound of $\Omega({\epsilon^{-2}/\log
(1/\epsilon)})$, meaning that our algorithm is near optimal in terms of
its dependency on $\epsilon$. This improves over previous work on this
problem. We show that generalizing to
$k$th order entropy requires close to linear space, and give additive
approximations using our algorithm.
What's Different: Distributed, Continuous Monitoring of Duplicate-Resilient Aggregates on Data Streams
G. Cormode, S. Muthukrishnan, W. Zhuang
International Conference on Data Engineering (ICDE),
pp 20--31,
2006.
[PDF]
[BIB]
Emerging applications in sensor systems and network-wide
IP traffic analysis present many technical challenges. They
need distributed monitoring and continuous tracking of events.
They have severe resource constraints not only at each site in
terms of per-update processing time and archival space for highspeed
streams of observations, but also crucially, communication
constraints for collaborating on the monitoring task. These
elements have been addressed in a series of recent works.
A fundamental issue that arises is that one cannot make the
uniqueness assumption on observed events which is present in
previous works, since widescale monitoring invariably encounters
the same events at different points. For example, within the network
of an Internet Service Provider packets of the same flow will
be observed in different routers; similarly, the same individual
will be observed by multiple mobile sensors in monitoring wild
animals. Aggregates of interest on such distributed environments
must be resilient to duplicate observations.
We study such duplicate-resilient aggregates that measure the
extent of the duplicationhow many unique observations are
there, how many observations are uniqueas well as standard
holistic aggregates such as quantiles and heavy hitters over
the unique items. We present accuracy guaranteed, highly
communication-efficient algorithms for these aggregates that
work within the time and space constraints of high speed streams.
We also present results of a detailed experimental study on both
real-life and synthetic data.

Space- and Time-Efficient Deterministic Algorithms for Biased Quantiles over Data Streams
G. Cormode, F. Korn, S. Muthukrishnan, D. Srivastava
ACM Principles of Database Systems (PODS),
2006.
[PDF]
[BIB]
Skew is prevalent in data streams, and should be taken into account
by algorithms that analyze the data.
The problem of finding ``biased quantiles''--- that is, approximate
quantiles which must be more accurate for more extreme values ---
is a framework for summarizing such skewed data on data streams.
We present the first deterministic algorithms for answering biased
quantiles queries accurately with small---sublinear in the input size---
space and time bounds in one pass.
The space bound is near-optimal, and the amortized update cost is
close to constant,
making it practical for handling high speed network data streams.
We demonstrate that not only can we prove theoretical properties of the
algorithm, but that it also uses less space than existing methods in many
practical settings, and is fast to maintain.
Fast Approximate Wavelet Tracking on Streams
G. Cormode, M. Garofalakis, D. Sacharidis
Extending Database Technology,
pp 4--22,
2006.
[PDF]
[BIB]
Recent years have seen growing interest in effective
algorithms for summarizing and querying massive, high-speed
data streams.
Randomized sketch synopses
provide accurate approximations for general-purpose summaries of
the streaming data distribution
(e.g., wavelets).
The focus of existing work has typically been on minimizing
{\em space requirements} of the maintained synopsis
--- however, to effectively support high-speed
data-stream analysis, a crucial practical requirement is to also
optimize: (1) the {\em update time} for incorporating a streaming
data element in the sketch, and (2) the {\em query time} for producing
an approximate summary (e.g., the top wavelet coefficients) from the
sketch.
Such time costs must be small enough to cope with rapid
stream-arrival rates and the real-time querying requirements of typical
streaming applications (e.g., ISP network monitoring).
With cheap and plentiful memory, space is often only a
secondary concern after query/update time costs.
In this paper, we propose the first fast solution to
the problem of tracking wavelet representations of
one-dimensional and multi-dimensional data streams, based on a
novel stream synopsis, the {\em Group-Count Sketch (GCS)}.
By imposing a hierarchical structure of groups over the data and
applying the GCS, our algorithms can quickly recover the most
important wavelet coefficients with guaranteed accuracy.
A tradeoff between query time and update time is established, by
varying the hierarchical structure of groups, allowing the right balance
to be found for specific data stream.
Experimental analysis confirms this tradeoff, and shows that all our
methods significantly outperform previously known methods
in terms of both update time and query time, while
maintaining a high level of accuracy.

Discrete Methods in Epidemiology
J. Abello and G. Cormode,
DIMACS,
AMS,
v70,
2006.
[PDF]
[BIB]
In general terms, epidemiology deals with populations rather than
individuals. One of its goals is to study the frequency of occurrences
of health related events. It has a major but not exclusive concern
with causes and determinants of disease patterns in populations. The
premise is that a systematic investigation of different populations
can identify causal and preventive factors. Epidemiology is an
observational rather than an experimental science. Sample questions
take the form of:
\begin{itemize}
\item
Does population exposure to $x$ increase the risk of a
disease $w$?
\item
Are dietary supplements $\{x,y,z\}$
beneficial in lowering the risk of malady $w$?
\item
Do behavioral interventions reduce risk behaviors?
\end{itemize}
We have observed that occurrence measures, causal inference and
study designs play prominent roles in the daily endeavors of a
typical epidemiologist. Descriptive and analytical epidemiology are
two overlapping flavors of this discipline.
Descriptive epidemiology attempts to describe patterns of disease
according to spatial and temporal information about the members of a
population. These patterns are described by tabulations or summaries
of surveys and polls or by parametric or non-parametric population
models. Models are in general global descriptions of the major part
of a data set. Patterns on the other hand are local features of the
data that can be described by association rules, mode or gaps in
density functions, outliers, inflection points in regressions,
symptom clusters, geographic hot spots, etc. Some epidemiologists
appear more interested in local patterns rather than in global
structure. This raises questions of how ``realistic'' certain
patterns are.
Analytical Epidemiology attempts to explain and predict the state of a
population's health. A typical goal is to summarize the
relationship between exposure and disease incidence by comparing two
measures of disease frequency. These comparisons may be affected by
chance, bias and by the presence or absence of an effect. This
explains naturally why statistical methods play a major role in
Epidemiology since bias is a central preoccupation of its
practitioners. Bias means a systematic error that results in an
incorrect or invalid estimate of the measure of association. This can
create or mask associations. Selection and information bias are two of
the main bias types. In particular, selection shall be independent of
exposure if the purpose of the study is to explain the relationship
between exposure and disease occurrence. In summary, one of the
central themes in analytical epidemiology is to understand the roles
of bias, chance and real effect in the understanding of populations
health.
To evaluate the role of chance, statistical hypothesis testing and
estimation appear to be the tools of choice. On the other hand,
generative models offer a way to describe infectious disease
dynamics. Since disease patterns are of primary interest, data mining
algorithms and detection of rules for pattern formation have a lot to
offer. Classification and taxonomies are useful tools to develop
predictive models. In general we believe that some questions
addressed by epidemiologists benefit from viewing them in a
mathematical and algorithmic context. This volume is a first attempt
to bridge the gap between the two communities. Its main emphasis is on
discrete methods that have successfuly addressed some epidemiological
question.
We begin by providing introductory chapters, on some of the key
methods from discrete data mining, by a selection of researchers in this
area; and on descriptive epidemiology by Schneider.
These collect, in a digested form, what we believe are among the most
potentially useful concepts in data mining and epidemiology.
Next there are two chapters reporting work in epidemiology that
suggest a discrete, analytical approach: Shannnon on
challenges in molecular data analysis, and Hirschman and Damianos on a
system for monitoring news wires for indications of disease outbreaks.
The remainder of the volume draws out further some of the key areas in
the
intersection between epidemiology and discrete methods.
The technique of formal concept analysis, and the amazing
depth of mathematical structure that arises from it is explored in
chapters by
Ozonoff, Pogel and Hannan, and Abello and Pogel.
The dynamics of disease transmission can be modeled in a variety of
ways, but often involves setting up systems of differential equations
to model the ebb and flow of infection, as demonstrated by
Desai, Boily, M\^{a}sse and Anderson, and V\'{a}zquez, in the context
of quite different problems.
Eubank, Kumar, Marathe, Srinivasan and Wang study massive interaction
graphs and give results by a combination of combinatorial methods and
simulation; Abello and Capalbo focus on properties of graphs generated
by
an appropriate random model; while Hartke takes a combinatorial model
of disease spread on tree graphs.
Finally, we see two applications of Support Vector Machines to
epidemiological data sets, from Li, Muchnik and Schneider (using
breast cancer data from the SEER database) and from Fradkin, Muchnik,
Hermans and Morgan (using data on disease in chickens).
Some other potential areas of interest that we have not touched in
this collection relate to patient confidentiality, coding and
cryptography and multiscale inference.

Communication-Efficient Distributed Monitoring of Thresholded Counts
G. Cormode, R. Keralapura, J. Ramimirtham
ACM SIGMOD International Conference on Management of Data (SIGMOD),
2006.
[PDF]
[BIB]
Monitoring is an issue of primary concern in current and
next generation networked systems. For example, the objective of
sensor networks is to monitor their surroundings for a variety of
different applications like atmospheric conditions, wildlife
behavior, and troop movements among others. Similarly, monitoring in
data networks is critical not only for accounting and management,
but also for detecting anomalies and attacks. Such monitoring
applications are inherently continuous and distributed, and must be
designed to minimize the communication overhead that they introduce.
In this context we introduce and study a
fundamental class of problems called ``thresholded counts'' where we
must return the {\em aggregate} frequency count of an event that is
continuously monitored by distributed nodes with a user-specified
accuracy whenever the actual count exceeds a given threshold value.
In this paper we propose to address the problem of thresholded
counts by setting local thresholds at each monitoring node and
initiating communication only when the locally observed data exceeds
these local thresholds. We explore algorithms in two categories:
{\em static thresholds} and {\em adaptive thresholds}.
In the static case, we consider
thresholds based on a linear combination of two alternate
strategies, and show that there exists an optimal blend of the two
strategies that results in minimum communication overhead. We
further show that this optimal blend can be found using a steepest
descent search. In the adaptive case, we propose algorithms that
adjust the local thresholds based on the observed distributions of
updated information in the distributed monitoring system. We use
extensive simulations not only to verify the accuracy of our
algorithms and validate our theoretical results, but also to
evaluate the performance of the two approaches. We find that both
approaches yield significant savings over the naive approach of
performing processing at a centralized location.

Combinatorial Algorithms for Compressed Sensing
G. Cormode, S. Muthukrishnan
SIROCCO,
2006.
[PDF]
[BIB]
In sparse approximation theory, the fundamental problem is to
reconstruct
a
signal A in R^n from linear measurements
with respect to a
dictionary of psi_i's. Recently, there is focus on the novel direction
of
Compressed Sensing [Donoho:04] where the reconstruction can be done with
very few---O(k log n)---linear measurements over a modified dictionary
if
the signal is compressible, that is, its information is concentrated in
k
coefficients with the original dictionary. In particular, these results
prove that there exists a single O(k log n) x n measurement matrix such
that any such signal can be reconstructed from these measurements, with
error at most O(1) times the worst case error for the class of such
signals. Compressed sensing has generated tremendous excitement both
because of the sophisticated underlying Mathematics and because of its
potential applications.
In this paper, we address outstanding open problems in Compressed
Sensing.
Our main result is an explicit construction of a non-adaptive
measurement
matrix and the corresponding reconstruction algorithm
so that with a number of measurements polynomial in k, log n, 1/$epsilon$,
we can reconstruct compressible signals. This is the first known
polynomial time explicit construction of any such measurement matrix. In
addition, our result improves the error guarantee from O(1) to 1 +
$\epsilon$ and improves the reconstruction time from poly{n} to poly{k log
n}.
Our second result is a randomized construction of O(k polylog{n})
measurements that work for each signal with high probability and gives
per-instance approximation guarantees rather than over the class of all
signals. Previous work on Compressed Sensing does not provide such
per-instance approximation guarantees; our result improves the best
known
number of measurements known from prior work in other areas including
Learning Theory, Streaming algorithms and Complexity Theory for this
case.
Our approach is combinatorial. In particular, we use two parallel sets
of
group tests, one to filter and the other to certify and estimate; the
resulting algorithms are quite simple to implement.

What's New: Finding Significant Differences in Network Data Streams
G. Cormode, S. Muthukrishnan
Transactions on Networking,
v13,
#6,
pp 1219-1232,
2005.
[PDF]
[BIB]
Monitoring and analyzing network traffic usage patterns is vital for
managing IP Networks.
An important problem is to provide network managers with information
about changes in traffic, informing them about ``what's new''.
Specifically, we focus on the challenge of finding significantly large
differences in traffic: over time, between interfaces and between
routers.
We introduce the idea of a {\em deltoid}: an item that has a large
difference,
whether the difference is {\em absolute}, {\em relative} or
{\em variational}.
We present novel algorithms for finding the most significant deltoids in high
speed traffic data, and prove that they use small space, very small time per
update, and are guaranteed to find significant deltoids with
pre-specified accuracy.
In experimental evaluation with real network traffic, our algorithms perform well and recover
almost all deltoids.
This is the first work to provide solutions
capable of working over the data with one pass, at network traffic speeds.

What's Hot and What's Not: Tracking Most Frequent Items Dynamically
G. Cormode, S. Muthukrishnan
ACM Transactions on Database Systems,
v30,
#1,
pp 249--278,
2005.
[PDF]
[BIB]
Most database management systems maintain statistics
on the underlying relation. One of the important statistics
is that of the 'hot items' in the relation: those
that appear many times (most frequently, or more than some threshold).
For example, end-biased histograms keep the hot items as part
of the histogram
and are used in selectivity estimation.
Hot items are used as simple outliers in data mining, and
in anomaly detection in many applications.
We present new methods for dynamically
determining the hot items at any time in a relation
which is undergoing deletion operations as well as inserts.
Our methods maintain small space data structures that
monitor the transactions on the relation, and when required, quickly output
all hot items, without rescanning the relation in the database.
With user-specified probability, all hot items are correctly reported.
Our methods rely on ideas from 'group testing'.
They are simple to implement, and have provable
quality, space and time guarantees.
Previously known algorithms for this problem that make similar quality and
performance guarantees can not handle deletions, and those that handle
deletions can not make similar guarantees without rescanning the database. Our
experiments with real and synthetic data shows that our
algorithms are accurate in dynamically tracking the hot items
independent of the rate of insertions and deletions.

Summarizing and Mining Skewed Data Streams
G. Cormode, S. Muthukrishnan
SIAM Conference on Data Mining (SDM),
2005.
[PDF]
[BIB]
Many applications generate massive data streams.
Summarizing such massive data requires fast, small space algorithms to
support post-hoc queries and mining.
An important observation is that such streams are rarely uniform, and
real data sources typically exhibit significant skewness.
These are well modeled by Zipf distributions, which are characterized
by a parameter, $z$, that captures the amount of skew.
We present a data stream summary that can answer point queries with
$\epsilon$ accuracy and show that the space needed is only
$O(\epsilon^{-\min\{1,1/z\}})$.
This is the first $o(1/\epsilon)$ space algorithm for this problem,
and we show it is essentially tight for skewed distributions.
We show that the same data structure can also estimate the $L_2$ norm
of the stream in $o(1/\epsilon^2)$ space for $z>1/2$, another
improvement over the existing $\Omega(1/\epsilon^2)$ methods.
We support our theoretical results with an experimental study over a
large variety of real and synthetic data.
We show that significant skew is present in both textual and
telecommunication data.
Our methods give strong accuracy, significantly better than other
methods, and behave exactly in line with their analytic bounds.

Summarizing and Mining Inverse Distributions on Data Streams via Dynamic Inverse Sampling
G. Cormode, S. Muthukrishnan, I. Rozenbaum
International Conference on Very Large Data Bases (VLDB),
pp 25--36,
2005.
[PDF]
[BIB]
Database management systems face the challenge of
dealing with massive data distributions which
arrive at high speeds while there is only small storage
available for managing and mining them.
Emerging data stream management systems approach this
problem by summarizing and mining the distributions
using samples or sketches. However, data distributions
can be ``viewed'' in different ways. For example,
a data stream of integer values can be viewed either
as the {\em forward} distribution $f(x)$, ie., the
number of occurrences of $x$ in the
stream, or as its inverse, $f^{-1}(i)$, which is
the number of items that appear $i$ times.
While both such ``views'' are equivalent in stored
data systems, over data streams that entail approximations,
they may be significantly different. In other words,
samples and sketches developed for the forward distribution
may be ineffective for summarizing or mining the
inverse distribution. Yet, many applications such as
IP traffic monitoring naturally rely on mining inverse
distributions.
We formalize the problems of managing and mining inverse
distributions and show provable differences between summarizing
the forward distribution vs the inverse distribution.
We present methods for summarizing and mining
inverse distributions of data streams: they rely
on a novel technique to maintain a dynamic sample over the stream with
provable guarantees which can be used for variety of
summarization tasks (building quantiles or equidepth histograms)
and mining (anomaly detection: finding heavy hitters,
and measuring the number of rare items), all with provable
guarantees on quality of approximations and time/space used by
our streaming methods. We also complement our analytical
and algorithmic results by presenting an experimental
study of the methods over network data streams.

Substring Compression Problems
G. Cormode, S. Muthukrishnan
ACM-SIAM Symposium on Discrete Algorithms (SODA),
pp 321--330,
2005.
[PDF]
[BIB]
We initiate a new class of string matching problems called
{\em Substring Compression Problems}.
Given a string $S$ that may be preprocessed,
the problem is to quickly find the
compressed representation or the compressed size of any query substring of
$S$
(Substring Compression Query or SCQ) or to find the length $l$
substring of $S$
whose compressed size is the least (Least Compressible Substring or LCS
problem).
Starting from the seminal paper of Lempel and Ziv over 25 years ago,
many different methods have emerged for compressing entire strings.
Determining substring compressibility is a natural variant that is
combinatorially and algorithmically challenging, yet surprisingly
has not been studied before. In addition, compressibility of strings
is emerging as a tool to compare biological sequences and analyze their
information content. However, typically, the compressibility
of the entire sequence is not as informative as that of portions of
the sequences. Thus substring compressibility may be a more suitable
basis for sequence analysis.
We present the first known, nearly optimal algorithms for substring
compression problems---SCQ, LCS and their generaliztions---that
are exact or provably approximate.
Our exact algorithms exploit the structure in strings via suffix trees
and
our approximate algorithms rely on new relationships we find between
Lempel-Ziv compression and string parsings.

Space Efficient Mining of Multigraph Streams
G. Cormode, S. Muthukrishnan
ACM Principles of Database Systems (PODS),
pp 271--282,
2005.
[PDF]
[BIB]
The challenge of monitoring massive amounts of data generated by
communication networks has led to the interest in data stream
processing.
We study streams of edges in massive communication multigraphs, defined by
(source, destination) pairs.
The goal is to compute properties of the underlying graph while using small
space (much smaller than the number of communicants),
and to avoid bias introduced because some edges may appear many times,
while others are seen only once.
We give results for three fundamental problems on multigraph degree sequences:
estimating frequency moments of degrees, finding the heavy hitter
degrees, and computing range sums of degree values.
In all cases we are able to show space bounds
for our summarizing algorithms
that are significantly smaller than storing complete information.
We use a variety of data stream methods:
sketches, sampling, hashing and distinct
counting, but a common feature is that we use the approach of
{\em cascaded summaries}: nesting multiple estimation techniques within one
another.
In our experimental study, we see that such summaries are highly
effective, enabling massive multigraph streams to be effectively
summarized to answer queries of interest with high accuracy using only
a small amount of space.

Sketching Streams Through the Net: Distributed Approximate Query Tracking
G. Cormode, M. Garofalakis
International Conference on Very Large Data Bases (VLDB),
pp 13--24,
2005.
[PDF]
[BIB]
While traditional database systems optimize for performance
on one-shot query processing, emerging large-scale monitoring
applications require continuous tracking of complex data-analysis
queries over collections of physically-distributed streams.
Thus, effective solutions have to be simultaneously space/time
efficient (at each remote monitor site), communication efficient
(across the underlying communication network), and provide
continuous, guaranteed-quality approximate query answers.
In this paper, we propose novel algorithmic solutions for the
problem of continuously tracking a broad class of complex
aggregate queries in such a distributed-streams setting.
Our tracking schemes maintain approximate query answers with
provable error guarantees, while simultaneously optimizing the
storage space and processing time at each remote site, as well
as the communication cost across the network.
In a nutshell, our algorithms rely on tracking general-purpose
randomized sketch summaries of local streams at remote sites
along with concise prediction models of local site behavior
in order to produce highly communication- and space/time-efficient
solutions.
The end result is a powerful approximate query tracking framework
that readily incorporates several complex analysis queries
(including distributed join and multi-join aggregates, and
approximate wavelet representations), thus giving the first
known low-overhead tracking solution for such queries in the
distributed-streams model.
Experiments with real data validate our approach, revealing
significant savings over naive solutions as well as our
analytical worst-case guarantees.

Holistic Aggregates in a Networked World: Distributed Tracking of Approximate Quantiles
G. Cormode, M. Garofalakis, S. Muthukrishnan, R. Rastogi
ACM SIGMOD International Conference on Management of Data (SIGMOD),
pp 25--36,
2005.
[PDF]
[BIB]
While traditional database systems optimize for performance on
one-shot queries, emerging large-scale monitoring applications
require continuous tracking of complex aggregates and
data-distribution summaries over collections of
physically-distributed streams.
Thus, effective solutions have to be simultaneously space efficient
(at each remote site), communication efficient (across the underlying
communication network), and provide continuous, guaranteed-quality
estimates.
In this paper, we propose novel algorithmic solutions
for the problem of continuously tracking complex holistic
aggregates in such a distributed-streams setting ---
our primary focus is on approximate quantile summaries, but our
approach is more broadly applicable and can handle other
holistic-aggregate functions (e.g., ``heavy-hitters'' queries).
We present the first known distributed-tracking schemes for
maintaining accurate quantile estimates with provable
approximation guarantees, while simultaneously optimizing
the storage space at each remote site as well as the
communication cost across the network.
In a nutshell, our algorithms employ a combination
of local tracking at remote sites and simple prediction
models for local site behavior in order to produce highly
communication- and space-efficient solutions.
We perform extensive experiments with real and synthetic data
to explore the various tradeoffs and understand the role of
prediction models in our schemes.
The results clearly validate our approach, revealing significant
savings over naive solutions as well as our analytical worst-case
guarantees.

Effective Computation of Biased Quantiles over Data Streams
G. Cormode, F. Korn, S. Muthukrishnan, D. Srivastava
International Conference on Data Engineering (ICDE),
pp 20--31,
2005.
[PDF]
[BIB]
Skew is prevalent in many data sources such as IP traffic
streams. To continually summarize the
distribution of such data, a high-biased set of quantiles
(e.g., 50th, 90th and 99th percentiles) with finer error
guarantees at higher ranks (e.g., errors of 5, 1 and 0.1
percent, respectively) is more useful than uniformly
distributed quantiles (e.g., 25th, 50th and 75th percentiles)
with uniform error guarantees.
In this paper, we address the following two problems.
First, can we compute quantiles with finer
error guarantees for the higher ranks of the data distribution
effectively, using less space and computation time than
computing all quantiles uniformly at the finest error?
Second, if {\em specific\/} quantiles and their error bounds
are requested {\em a priori}, can the necessary space usage and
computation time be reduced?
We answer both questions in the affirmative by formalizing them as the
``high-biased'' quantiles and the ``targeted'' quantiles problems,
respectively, and presenting algorithms with provable guarantees, that
perform
significantly better than previously known solutions for these problems.
We implemented our algorithms in the Gigascope data stream management
system, and evaluated alternate approaches for maintaining the
relevant summary structures. Our experimental results on real and
synthetic IP data streams complement our theoretical analyses, and
highlight the importance of lightweight, non-blocking implementations
when maintaining summary structures over high-speed data streams.

An Improved Data Stream Summary: The Count-Min Sketch and its Applications
G. Cormode, S. Muthukrishnan
Journal of Algorithms,
v55,
#1,
pp 58--75,
2005.
[PDF]
[BIB]
We introduce a new sublinear space data structure---the Count-Min Sketch--- for summarizing data streams. Our sketch allows fundamental queries in data stream summarization such as point, range, and inner product queries to be approximately answered very quickly; in addition, it can be applied to solve several important problems in data streams such as finding quantiles, frequent items, etc. The time and space bounds we show for using the CM sketch to solve these problems significantly improve those previously known --- typically from $1/epsilon^2$ to $1/\epsilon$ in factor.
What's New: Finding Significant Differences in Network Data Streams
G. Cormode, S. Muthukrishnan
Proceedings of IEEE Infocom,
pp 1534--1545,
2004.
[PDF]
[BIB]
Monitoring and analyzing network traffic usage patterns
is vital for managing IP Networks. An important problem is to provide
network managers with information about changes in traffic, informing them
about ``what's new''. Specifically, we focus on the challenge of finding
significantly large differences in traffic: over time, between interfaces
and between routers. We introduce the idea of a deltoid: an item that has
a large difference, whether the difference is absolute, relative or
variational.
We present novel algorithms for finding the most significant deltoids in
high speed traffic data, and prove that they use small space, very small
time per update, and are guaranteed to find significant deltoids with
pre-specified accuracy. In experimental evaluation with real network
traffic, our algorithms perform well and recover almost all deltoids. This
is the first work to provide solutions capable of working over the data
with one pass, at network traffic speeds.

The Hardness of the Lemmings Game, or Oh no, more NP-Completeness Proofs
G. Cormode
Proceedings of Third International Conference on Fun with Algorithms,
pp 65--76,
2004.
[PDF]
[BIB]
In the computer game `Lemmings', the player must guide a tribe of green-haired
lemming creatures to safety, and save them from an untimely demise.
We formulate the decision problem which is, given a level of the game, to
decide whether it is possible to complete the level (and hence to find
a solution to that level).
Under certain limitations, this can be decided in polynomial
time, but in general the problem is shown to be NP-Hard.
This can hold even if there
is only a single lemming to save, thus showing that it is hard to
approximate the number of saved lemmings to any factor.
How to increase the acceptance ratios of top conferences
G. Cormode, A. Czumaj, S. Muthukrishnan
Proceedings of Third International Conference on Fun with Algorithms,
pp 262--273,
2004.
[PDF]
[BIB]
In the beginning was the pub. This work was triggered by a pub conversation where the
authors observed that many resumes list acceptance ratios of conferences where their
papers appear, boasting the low acceptance ratio. The lower the ratio, better your paper
looks. The list might look equally impressive if one listed the rejection ratio of
conferences where ones paper was submitted and rejected. We decided to lampoon
rather than lament the effort the PC typically put in: wouldn't the world be better if we
could encourage only high quality submissions and so run top conferences with very
high acceptance ratios? This paper captures our thoughts, and it is best consumed in a
pub (and in color).
Holistic UDAFs at streaming speeds
G. Cormode, F. Korn, S. Muthukrishnan, T. Johnson, O. Spatscheck, D. Srivastava
ACM SIGMOD International Conference on Management of Data (SIGMOD),
pp 35--46,
2004.
[PDF]
[BIB]
Many algorithms have been proposed to approximate holistic ag-
gregates, such as quantiles and heavy hitters, over data streams.
However, little work has been done to explore what techniques are
required to incorporate these algorithms in a data stream query pro-
cessor, and to make them useful in practice.
In this paper, we study the performance implications of using
user-dened aggregate functions (UDAFs) to incorporate selection-
based and sketch-based algorithms for holistic aggregates into a
data stream management system's query processing architecture.
We identify key performance bottlenecks and tradeoffs, and pro-
pose novel techniques to make these holistic UDAFs fast and space-
efcient for use in high-speed data stream applications. We evalu-
ate performance using generated and actual IP packet data, focus-
ing on approximating quantiles and heavy hitters. The best of our
current implementations can process streaming queries at OC48
speeds (2x 2.4Gbps).

Electronic Books in Digital Libraries
G. Ozsoyoglu, N. H. Balkir, G. Cormode, Z. M. Ozsoyoglu
IEEE Transactions on Knowledge and Data Engineering,
v16,
#3,
pp 317--331,
2004.
[PDF]
[BIB]
Electronic Book is an application with a multimedia database of instructional resources, which include hyperlinked text, instructor's audio/video clips, slides, animation, still images, etc. as well as content-based infomration about these data, and metadata such as annotations, tags, and cross-referencing information. Electronic books in the Internet or on CDs today are not easy to learn from. We propose the use of a multimedia database of instructional resources in constructing and delivering multimedia lessons about topics in an electronic book.
We introduce an electronic book data model containing (a) topic objects and and (b) instructional resources, called instruction modules which are multimedia presentations possibly capturing real-life lectures of instructors. We use the notion of topic prerequisites for topics at different detail levels, to allow electronic book users to request / compose multimedia lessons about topics on the electronic book. We present automated construction of the ''best'' user-tailored lesson (as a multimedia presentation).

Diamond in the Rough: Finding Hierarchical Heavy Hitters in Multi-Dimensional Data
G. Cormode, F. Korn, S. Muthukrishnan, D. Srivastava
ACM SIGMOD International Conference on Management of Data (SIGMOD),
pp 155--166,
2004.
[PDF]
[BIB]
Data items archived in data warehouses or those that arrive online as
streams typically have attributes which take values from multiple
hierarchies (e.g., time and geographic location; source and
destination IP addresses). Providing an aggregate view of such data is
important to summarize, visualize, and analyze. We develop the
aggregate view based on certain hierarchically organized sets of
large-valued regions (``heavy hitters''). Such Hierarchical Heavy
Hitters (HHHs) were previously introduced as a crucial aggregation
technique in one dimension. In order to analyze the wider range of
data warehousing applications and realistic IP data streams, we
generalize this problem to multiple dimensions.
We illustrate and study two variants of HHHs for multi-dimensional
data. In particular, we identify ``overlap'' and ``split'' variants,
depending on how an aggregate computed for a child node in the
multi-dimensional hierarchy is propagated to its parent
element(s). For data warehousing applications, we present offline
algorithms that take multiple passes over the data and produce the
exact HHHs. For data stream applications, we present online
algorithms that find approximate HHHs in one pass, with proven
accuracy guarantees.
We show experimentally, using real and synthetic data, that our
proposed online algorithms yield outputs which are very similar
(virtually identical, in many cases) to their offline
counterparts. The lattice property of the product of hierarchical
dimensions (``diamond'') is crucially exploited in our online
algorithms to track approximate HHHs using only a small, fixed number
of statistics per candidate node, regardless of the number of
dimensions.

Diamond in the Rough: Finding Hierarchical Heavy Hitters in Multi-Dimensional Data
G. Cormode, F. Korn, S. Muthukrishnan, D. Srivastava
ACM SIGMOD International Conference on Management of Data (SIGMOD),
pp 155--166,
2004.
[PDF]
[BIB]
Data items archived in data warehouses or those that arrive online as
streams typically have attributes which take values from multiple
hierarchies (e.g., time and geographic location; source and
destination IP addresses). Providing an aggregate view of such data is
important to summarize, visualize, and analyze. We develop the
aggregate view based on certain hierarchically organized sets of
large-valued regions (``heavy hitters''). Such Hierarchical Heavy
Hitters (HHHs) were previously introduced as a crucial aggregation
technique in one dimension. In order to analyze the wider range of
data warehousing applications and realistic IP data streams, we
generalize this problem to multiple dimensions.
We illustrate and study two variants of HHHs for multi-dimensional
data. In particular, we identify ``overlap'' and ``split'' variants,
depending on how an aggregate computed for a child node in the
multi-dimensional hierarchy is propagated to its parent
element(s). For data warehousing applications, we present offline
algorithms that take multiple passes over the data and produce the
exact HHHs. For data stream applications, we present online
algorithms that find approximate HHHs in one pass, with proven
accuracy guarantees.
We show experimentally, using real and synthetic data, that our
proposed online algorithms yield outputs which are very similar
(virtually identical, in many cases) to their offline
counterparts. The lattice property of the product of hierarchical
dimensions (``diamond'') is crucially exploited in our online
algorithms to track approximate HHHs using only a small, fixed number
of statistics per candidate node, regardless of the number of
dimensions.

An Improved Data Stream Summary: The Count-Min Sketch and its Applications
G. Cormode, S. Muthukrishnan
Proceedings of Latin American Theoretical Informatics (LATIN),
pp 29-38,
2004.
[PDF]
[BIB]
We introduce a new sublinear space data structure---the Count-Min Sketch--- for summarizing data streams. Our sketch allows fundamental queries in data stream summarization such as point, range, and inner product queries to be approximately answered very quickly; in addition, it can be applied to solve several important problems in data streams such as finding quantiles, frequent items, etc. The time and space bounds we show for using the CM sketch to solve these problems significantly improve those previously known --- typically from $1/\epsilon^2$ to $1/\epsilon$ in factor.
Journal version appeared in Journal of Algorithms.
What's Hot and What's Not: Tracking Most Frequent Items Dynamically
G. Cormode, S. Muthukrishnan
ACM Principles of Database Systems (PODS),
pp 296-306,
2003.
[PDF]
[BIB]
Most database management systems maintain statistics on the underlying relation. One of the important statistics is that of the 'hot items' in the relation: those that appear many times (most frequently, or more than some threshold). For example, end-biased histograms keep the hot items as part of the histogram and are used in selectivity estimation. Hot items are used as simple outliers in data mining, and in anomaly detection in networking applications. We present a new algorithm for dynamically determining the hot items at any time in the relation that is undergoing deletion operations as well as inserts. Our algorithm maintains a small space data structure that monitors the transactions on the relation, and when required, quickly outputs all hot items, without rescanning the relation in the database. With user-specified probability, it is able to report all hot items. Our algorithm relies on the idea of ``group testing'', is simple to implement, and has provable quality, space and time guarantees. Previously known algorithms for this problem that make similar quality and performance guarantees can not handle deletions, and those that handle deletions can not make similar guarantees without rescanning the database. Our experiments with real and synthetic data shows that our algorithm is remarkably accurate in dynamically tracking the hot items independent of the rate of insertions and deletions.

Stable distributions for stream computations: it's as easy as 0,1,2
G. Cormode
Workshop on Management and Processing of Massive Data Streams at FCRC,
2003.
[PDF]
[BIB]
A surprising number of data stream problems are solved by methods involving computations with stable distributions. This paper will give a short summary of some of these problems, and how the best known solutions depend on use of stable distributions; it also lists some related open problems.
Sequence Distance Embeddings
G. Cormode
University of Warwick,
2003.
[PDF]
[BIB]
Sequences represent a large class of fundamental objects in Computer Science - sets, strings, vectors and permutations are considered to be sequences. Distances between sequences measure their similarity, and computations based on distances are ubiquitous: either to compute the distance, or to use distance computation as part of a more complex problem. This thesis takes a very specific approach to solving questions of sequence distance: sequences are embedded into other distance measures, so that distance in the new space approximates the original distance. This allows the solution of a variety of problems including:
\begin{itemize}
\item
Fast computation of short `sketches' in a variety of computing models, which allow sequences to be compared in constant time and space irrespective of the size of the original sequences.
\item
Approximate nearest neighbor and clustering problems, significantly faster than the na�ve exact solutions.
\item
Algorithms to find approximate occurrences of pattern sequences in long text sequences in near linear time.
\item
Efficient communication schemes to approximate the distance between, and exchange, sequences in close to the optimal amount of communication.
\end{itemize}
Solutions are given for these problems for a variety of distances, including fundamental distances on sets and vectors; distances inspired by biological problems for permutations; and certain text editing distances for strings. Many of these embeddings are computable in a streaming model where the data is too large to store in memory, and instead has to be processed as and when it arrives, piece by piece. The embeddings are also shown to be practical, with a series of large scale experiments which demonstrate that given only a small space, approximate solutions to several similarity and clustering problems can be found that are as good as or better than those found with prior methods.

Finding Hierarchical Heavy Hitters in Data Streams
G. Cormode, F. Korn, S. Muthukrishnan, D. Srivastava
International Conference on Very Large Data Bases (VLDB),
pp 464-475,
2003.
[PDF]
[BIB]
Aggregation along hierarchies is a critical summary technique in a large variety of online applications including decision support, and network management (e.g., IP clustering, denial-of-service attack monitoring). Despite the amount of recent study that has been dedicated to online aggregation on sets (e.g., quantiles, hot items), surprisingly little attention has been paid to summarizing hierarchical structure in stream data.
The problem we study in this paper is that of finding Hierarchical Heavy Hitters (HHH): given a hierarchy and a fraction phi, we want to find all HHH nodes that have a total number of descendants in the data stream larger than phi of the total number of elements in the data stream, after discounting the descendant nodes that are HHH nodes. The resulting summary gives a topological 'cartogram' of the hierarchical data. We present deterministic and randomized algorithms for finding HHHs, which builds upon existing techniques by incorporating the hierarchy into the algorithms. Our experiments demonstrate several factors of improvement in accuracy over the straightforward approach, which is due to making algorithms hierarchy-aware

Estimating Dominance Norms of Multiple Data Streams
G. Cormode, S. Muthukrishnan
European Symposium on Algorithms,
LNCS,
v2838,
2003.
[PDF]
[BIB]
There is much focus in the algorithms and database communities on
designing tools to manage and mine data streams.
Typically, data streams consist of multiple signals.
Formally, a stream of multiple signals is $(i,a_{i,j})$
where $i$'s correspond to the domain, $j$'s index the different
signals and $a_{i,j}\geq 0$ give the value of the $j$th signal at
point $i$.
We study the problem of finding norms that are cumulative of
the multiple signals in the data stream.
For example, consider the max-dominance norm,
defined as $\Sigma_i \max_j \{a_{i,j}\}$.
It may be thought as estimating the norm of the ``upper
envelope'' of the multiple signals, or alternatively, as
estimating the norm of the ``marginal'' distribution of tabular data streams.
It is used in applications to estimate the ``worst case influence'' of
multiple processes, for example in IP traffic analysis, electrical
grid monitoring and
financial domain.
In addition, it is a natural measure, generalizing the union of data
streams or counting distinct elements in data streams.
We present the first known data stream algorithms for estimating
max-dominance of multiple signals.
In particular, we use workspace and time-per-item that are both sublinear (in
fact, poly-logarithmic) in the input size. In contrast
other
notions of dominance on streams $a$, $b$ --- min-dominance
($\Sigma_i \min_j \{a_{i,j}\}$),
count-dominance ($|\{i | a_i > b_i\}|$) or relative-dominance
($\Sigma_i a_i/\max\{1,b_i\}$ ) --- are all impossible to estimate
accurately with sublinear space.

Comparing Data Streams Using Hamming Norms
G. Cormode, M. Datar, P. Indyk, S. Muthukrishnan
IEEE Transactions on Knowledge and Data Engineering,
v15,
#3,
pp 529--541,
2003.
[PDF]
[BIB]
Massive data streams are now fundamental to many data processing
applications. For example, Internet routers produce large scale
diagnostic data streams. Such streams are rarely stored in
traditional databases, and instead must be processed ``on the fly'' as
they are produced. Similarly, sensor networks produce multiple data
streams of observations from their sensors. There is growing focus on
manipulating data streams, and hence, there is a need to identify
basic operations of interest in managing data streams, and to support
them efficiently.
We propose computation of the Hamming norm as a basic operation of
interest. The Hamming norm formalizes ideas
that are used throughout data processing.
When applied to a single stream, the Hamming norm gives the number of distinct
items that are
present in that data stream, which is a statistic of great interest in
databases. When applied to a pair of streams, the Hamming norm gives
an important measure of (dis)similarity: the number of unequal item
counts in the two streams. Hamming norms have many uses
in comparing data streams.
We present a novel approximation technique for estimating the Hamming norm
for massive data streams; this relies on what we call the ``$l_0$
{\it sketch}''
and we prove its accuracy. We test our approximation method on a large
quantity of synthetic and real stream data, and show that the
estimation is accurate to within a few percentage points.

The String Edit Distance Matching Problem with Moves
G. Cormode, S. Muthukrishnan
ACM-SIAM Symposium on Discrete Algorithms (SODA),
pp 667--676,
2002.
[PDF]
[BIB]
The edit distance between two strings $S$ and $R$
is defined to be the minimum number of character inserts,
deletes and changes needed to convert $R$ to $S$.
Given a text string $t$ of length $n$, and a pattern string $p$
of length $m$, informally, the string edit distance matching problem is to
compute the smallest edit distance between $p$ and
substrings of $t$. A well known dynamic programming algorithm
takes time $O(nm)$ to solve this problem, and it is an important
open problem in Combinatorial Pattern Matching to
significantly improve this bound.
We relax the problem
so that (a) we allow an additional operation, namely,
{\em substring moves}, and (b) we approximate the string edit distance
upto a factor of $O(\log n \log^* n)$. ($\log^* n$ is the
number of times $\log$ function is applied to $n$ to produce a constant.)
Our result is a near linear time deterministic
algorithm for this version of the problem.
This is the first
known significantly subquadratic algorithm for a string
edit distance problem in which the distance involves
nontrivial alignments.
Our results are obtained by embedding strings into $L_1$
vector space using a simplified parsing technique we call
{\em Edit Sensitive Parsing} (ESP). This embedding is
approximately distance preserving, and we show many applications
of this embedding to string proximity problems including nearest neighbors,
outliers, and streaming computations with strings.

Fast Mining of Tabular Data via Approximate Distance Computations
G. Cormode, P. Indyk, N. Koudas, S. Muthukrishnan
International Conference on Data Engineering (ICDE),
pp 605--616,
2002.
[PDF]
[BIB]
Tabular data abound in many data stores: traditional
relational databases store tables, and
new applications also generate massive tabular
datasets.
For example, consider the geographic distribution of cell
phone traffic at different base stations across the
country
(which can be represented by a table indexed by
latitude and longitude),
or the evolution of traffic at Internet
routers over time
(which can be represented by a table
indexed by time, source and destination IP addresses).
Detecting similarity patterns in such data
sets (e.g., which geographic regions have similar cell phone
usage distribution, which IP subnet traffic distributions
over time intervals are similar, etc) is of great importance.
Identification of such patterns poses many
conceptual challenges (what is a suitable similarity
distance function for two ``regions'') as well as
technical challenges (how to perform similarity computations
efficiently as massive tables get accumulated over time) that
we address.
We present methods for determining
similar regions in massive tabular data.
Our methods are for computing the ``distance'' between
any two subregions of a tabular data: they are approximate,
but highly accurate as we prove mathematically, and they are
fast, running in time nearly linear in the table size. Our methods
are general since these distance computations can be
applied to any mining or
similarity algorithms that use $L_p$ norms.
A novelty of our distance computation procedures is that they
work for any $L_p$ norms --- not only the traditional
$p=2$ or $p=1$, but for all $p\leq 2$; the choice of $p$,
say {\em fractional} $p$,
provides an interesting alternative similarity behavior!
We use our algorithms in a detailed
experimental study of the clustering patterns in real tabular
data obtained from one of AT\&T's data stores and show that our
methods are substantially faster than straightforward
methods while remaining highly accurate, and able to detect
interesting patterns by varying the value of $p$.

Comparing Data Streams Using Hamming Norms
G. Cormode, M. Datar, P. Indyk, S. Muthukrishnan
International Conference on Very Large Data Bases (VLDB),
pp 335--345,
2002.
[PDF]
[BIB]
Massive data streams are now fundamental to many data processing
applications. For example, Internet routers produce large scale
diagnostic data streams. Such streams are rarely stored in
traditional databases, and instead must be processed ``on the fly'' as
they are produced. Similarly, sensor networks produce multiple data
streams of observations from their sensors. There is growing focus on
manipulating data streams, and hence, there is a need to identify
basic operations of interest in managing data streams, and to support
them efficiently.
We propose computation of the Hamming norm as a basic operation of
interest. The Hamming norm formalizes ideas
that are used throughout data processing.
When applied to a single stream, the Hamming norm gives the number of distinct
items that are
present in that data stream, which is a statistic of great interest in
databases. When applied to a pair of streams, the Hamming norm gives
an important measure of (dis)similarity: the number of unequal item
counts in the two streams. Hamming norms have many uses
in comparing data streams.
We present a novel approximation technique for estimating the Hamming norm
for massive data streams; this relies on what we call the ``$l_0$
{\it sketch}''
and we prove its accuracy. We test our approximation method on a large
quantity of synthetic and real stream data, and show that the
estimation is accurate to within a few percentage points.
Journal version
in {\em {IEEE} Transactions on Knowledge and Data
Engineering} 15(3):529--541, 2003

Permutation Editing and Matching via Embeddings
G. Cormode, S Muthukrishnan, S. C. Sahinalp
International Colloquium on Automata, Languages and Programming (ICALP),
v2076,
pp 481--492,
2001.
[PDF]
[BIB]
If the genetic maps of two species are modelled as permutations of
(homologous) genes, the number of chromosomal rearrangements
in the form of deletions, block moves, inversions etc. to transform one
such permutation to another can be used as a measure of their evolutionary
distance. Motivated by such scenarios in Computational Biology and
elsewhere, we study problems of
computing distances between permutations
as well as matching permutations in
sequences, and finding most similar permutation from a collection
(``nearest neighbor'').
We adopt a
general approach: embed permutation distances of relevance into well-known
vector spaces in an approximately distance-preserving manner,
and solve the resulting problems on the well-known spaces.
Our results are as follows:
\begin{itemize}
\item
We present the first known approximately distance preserving
embeddings of these permutation distances into well-known spaces.
\item
Using these embeddings, we obtain several results, including
the first known efficient
solution for approximately solving nearest neighbor
problems with permutations
and
the first known
algorithms for finding permutation distances in the ``data stream''
model.
\item
We consider a novel
class of problems called {\em permutation matching} problems
which are similar to string matching problems,
except that the
pattern is a permutation (rather than a string)
and present linear or near-linear time
algorithms for approximately solving permutation matching problems;
in contrast, the corresponding string problems take significantly
longer.

Electronic Books in Digital Libraries
G. Ozsoyoglu, N. H. Balkir, G. Cormode, Z. M. Ozsoyoglu
Proceedings of IEEE Advances in Digital Libraries (ADL),
pp 5--14,
2000.
[PDF]
[BIB]
Electronic Book is an application with a multimedia database of instructional resources, which include hyperlinked text, instructor's audio/video clips, slides, animation, still images, etc. as well as content-based infomration about these data, and metadata such as annotations, tags, and cross-referencing information. Electronic books in the Internet or on CDs today are not easy to learn from. We propose the use of a multimedia database of instructional resources in constructing and delivering multimedia lessons about topics in an electronic book.
We introduce an electronic book data model containing (a) topic objects and and (b) instructional resources, called instruction modules which are multimedia presentations possibly capturing real-life lectures of instructors. We use the notion of topic prerequisites for topics at different detail levels, to allow electronic book users to request / compose multimedia lessons about topics on the electronic book. We present automated construction of the ''best'' user-tailored lesson (as a multimedia presentation).

Communication Complexity of Document Exchange
G. Cormode, M. Paterson, S. C. Sahinalp, U. Vishkin
ACM-SIAM Symposium on Discrete Algorithms (SODA),
pp 197--206,
2000.
[PDF]
[BIB]
We have two users, $A$ and $B$, who hold documents $x$ and $y$ respectively.
Neither of the users has any information about the other's document.
They exchange messages
so that $B$ computes $x$; it may be required that $A$ compute
$y$ as well.
Our goal is to design communication protocols with
the main objective of minimizing the total number of bits they exchange;
other objectives are
minimizing the number of rounds and the complexity of internal
computations.
An important notion which determines the efficiency of the
protocols is how one measures the distance between $x$ and $y$.
We consider several metrics for measuring this distance,
namely the Hamming metric, the Levenshtein metric (edit distance),
and a new LZ metric, which is introduced in this paper.
We show how to estimate the distance between $x$ and $y$ using a
single message of logarithmic size.
For each metric, we present the first
communication-efficient protocols, which often
match the corresponding lower bounds.
A consequence of these are error-correcting codes for these error
models which correct up to $d$ errors in $n$ characters using
$O(d \log n)$ bits.
Our most interesting methods use a new
{\em histogram transformation} that we introduce to convert
edit distance to $L_1$ distance.

Accurate and Efficient Private Release of Datacubes and Contingency Tables
Graham Cormode, Cecilia Procopiuc, Divesh Srivastava, Grigory Yaroslavtsev
IEEE International Conference on Data Engineering,
2013.
[PDF]
[BIB]
IEEE Copyright
This version of the work is reprinted here with permission of IEEE for your personal use. Not for redistribution. The definitive version was published in 2012 , 2013-04-08
{A central problem in releasing information is to do so accurately while providing a privacy guarantee on the output. Recently, there has been much work which guarantees differential privacy on the output, and aims to maximize the utility within this guarantee, for the class of {em linear queries}, which include basic counting queries, data cubes, and contingency tables. Much work follows a common template: pick a ``strategy'' set of linear queries to apply to the data, then use the noisy answers to these queries to reconstruct the queries of interest. This entails either picking a strategy set that is hoped to be good for the queries, or by performing an exhaustive search over the space of all possible strategies.
In this paper, we propose a new approach that balances accuracy and
efficiency: we show how to optimize the accuracy of a given strategy by
answering some strategy queries more accurately than others, based on the target queries. This leads to an efficient optimal noise allocation for many popular strategies, including wavelets, hierarchies, Fourier coefficients and more.
For the important case of marginal queries (equivalently, subsets of the data cube), we show that this strictly improves on previous methods, both analytically and empirically. Our results also extend to ensuring that the returned query answers are consistent with an (unknown) data set at minimal extra cost in
terms of time and noise.
}

Don’t Let The Negatives Bring You Down: Algorithms for Sampling from Streams of Signed Updates
Graham Cormode, Edith Cohen, Nicholas Duffield
ACM SIGMETRICS,
2012.
[PDF]
[BIB]
ACM Copyright
(c) ACM, 2011. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in ACM SIGMETRICS, 2012-06-30.
{Random sampling is a powerful tool for working with large data. Queries over the full
dataset are replaced by approximate queries over the smaller (and hence easier to store and manipulate) sample. The sample constitutes a flexible summary that supports a wide class of queries. But in many applications, datasets are modified with time, and it is desirable to update samples without requiring access to the full underlying datasets. In this paper, we introduce and analyze novel techniques for sampling over dynamic data, modeled as a stream of modifications to weights associated with each key.
While sampling schemes designed for stream applications can often readily accommodate positive updates to the dataset, much less is known for the case of negative updates, where weights are reduced or items deleted altogether. We primarily consider the turnstile model of streams, and extend classic schemes to incorporate negative updates. Perhaps surprisingly, the modifications to handle negative updates turn out to be natural and seamless extensions of the well-known original positive update-only algorithms. We show that they produce unbiased estimators, and relate their performance to the behavior of corresponding algorithms on insert-only streams with different parameters. A careful analysis is necessitated, in order to account for the fact that sampling choices for one key now depend on the choices made for other keys.}

A Dataset Search Engine for the Research Document Corpus
Graham Cormode, Divesh Srivastava, Srinivas Bangalore, Marios Hadjieleftheriou
ICDE 2012,
2012.
[PDF]
[BIB]
IEEE Copyright
This version of the work is reprinted here with permission of IEEE for your personal use. Not for redistribution. The definitive version was published in ICDE 2012. , 2012-04-01
{A key step in validating a proposed idea or system
is to evaluate over a suitable data set. However, to this date there
have been no useful tools for researchers to understand which
datasets have been used for what purpose, or in what prior
work. Instead, they have to manually browse through papers
to find suitable datasets and their URLs, which is laborious and
inefficient. To better aid the data discovery process, and provide a
better understanding of how and where datasets have been used,
we propose a framework to effectively identify datasets within the
scientific corpus. The key technical challenges are identification
of datasets, and discovery of the association between a dataset
and the URLs where they can be accessed. Based on this, we
have built a user friendly web-based search interface for users
to conveniently explore the dataset-paper relationships, and find
relevant datasets and their properties.}

Structure Aware Sampling on Data Streams
Edith Cohen, Graham Cormode, Nicholas Duffield
ACM Sigmetrics 2011 Conference,
2011.
[PDF]
[BIB]
ACM Copyright
(c) ACM, 2011. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in ACM SIGmetrics 2011 Conference , 2011-06-07.
{The massive data streams observed in network monitoring, data processing and
scientific studies are typically too large to store. For many applications over such data, we must obtain compact summaries of the stream. These summaries should allow accurate answering of post hoc queries with estimates which approximate the true answers over the original stream. The data often has an underlying structure which makes certain subset queries, in particular {em range queries}, more relevant than arbitrary subsets. Applications such as access control, change detection, and heavy hitters typically involve subsets that are ranges
or unions thereof.
Random sampling is a natural summarization tool, being easy to implement and flexible to use. Known sampling methods are good for arbitrary queries but fail to optimize for the common case of range queries. Meanwhile, specialized summarization algorithms have been proposed for range-sum queries and related problems. These can outperform sampling giving fixed space resources, but lack
its flexibility and simplicity. Particularly, their accuracy degrades when queries span multiple ranges.
We define new stream sampling algorithms with a smooth and tunable trade-off between accuracy on range-sum queries and arbitrary subset-sum queries. The technical key is to relax requirements on the variance over all subsets to enable better performance on the ranges of interest. This boosts the accuracy on range queries while retaining the prime benefits of sampling, in particular flexibility and accuracy, with tail bounds having optimality guarantees. Our experimental study indicates that structure-aware summaries drastically improve range-sum accuracy with respect to state-of-the-art stream sampling algorithms and outperform deterministic methods on range-sum queries and hierarchical heavy
hitter queries.}

Continuous Distributed Monitoring: A Short Survey
Graham Cormode
AlMoDep Workshop,
2011.
[PDF]
[BIB]
ACM Copyright
((c) ACM, 2011. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in AlMoDep Workshop , 2011-09-25.
{In the model of continuous distributed monitoring, a number
of observers each see a stream of observations. Their
goal is to work together to compute a function of the union
of their observations. This can be as simple as counting the
total number of observations, or more complex non-linear
functions such as tracking the entropy of the induced distribution.
Assuming that it is too costly to simply centralize
all the observations, it becomes quite challenging to design
solutions which provide a good approximation to the current
answer, while bounding the communication cost of the observers,
and their other resources such as their space usage.
This survey introduces this model, and describe a selection
results in this setting, from the simple counting problem to
a variety of other functions that have been studied.}
Streaming Graph Computations with a Helpful Advisor
Graham Cormode, Michael Mitzenmacher, Justin Thaler
European Symposium on Algorithms (ESA),
2010.
[BIB]
Prediction Promotes Privacy In Dynamic Social Networks
Graham Cormode, Divesh Srivastava, Balachander Krishnamurthy, Smriti Bhagat
3rd Workshop on Online Social Networks,
2010.
[PDF]
[BIB]
USENIX Copyright
The definitive version was published in Proceedings of WOSN 2010, Usenix. , 2010-06-22
{(OSNs) has looked at privacy preserving techniques for
publishing a single instance of the network. However,
OSNs evolve and a single instance is inadequate for analyzing
their evolution or performing longitudinal data
analysis. We study the problem of repeatedly publishing
OSN data as the network evolves while preserving
privacy of users. Publishing multiple instances independently
has privacy risks, since stitching the information
together may allow an adversary to identify users. We
provide methods to anonymize a dynamic network when
new nodes and edges are added to the published network.
These methods use link prediction algorithms to model
the evolution. Using this predicted graph to perform
group-based anonymization, the loss in privacy caused
by new edges can be reduced almost entirely. We propose
metrics for privacy loss, and evaluate them for publishing
multiple OSN instances.}

Prediction Promotes Privacy In Dynamic Social Networks
Graham Cormode, Divesh Srivastava, Balachander Krishnamurthy, Smriti Bhagat
3rd Workshop on Online Social Networks,
2010.
[PDF]
[BIB]
USENIX Copyright
The definitive version was published in Proceedings of WOSN 2010, Usenix. , 2010-06-22
{(OSNs) has looked at privacy preserving techniques for
publishing a single instance of the network. However,
OSNs evolve and a single instance is inadequate for analyzing
their evolution or performing longitudinal data
analysis. We study the problem of repeatedly publishing
OSN data as the network evolves while preserving
privacy of users. Publishing multiple instances independently
has privacy risks, since stitching the information
together may allow an adversary to identify users. We
provide methods to anonymize a dynamic network when
new nodes and edges are added to the published network.
These methods use link prediction algorithms to model
the evolution. Using this predicted graph to perform
group-based anonymization, the loss in privacy caused
by new edges can be reduced almost entirely. We propose
metrics for privacy loss, and evaluate them for publishing
multiple OSN instances.}