
180 Park Ave - Building 103
Florham Park, NJ
Statistical Distortion: Consequences of Data Cleaning
Tamraparni Dasu, Ji Loh
VLDB foundation,
2012.
[PDF]
[BIB]
VLDB Foundation Copyright
The definitive version was published in Very Large Databases, 2012. , Volume PVLDB vol.5/VLDB 2012, 2012-08-30
{We introduce the notion of {em statistical distortion}
as an essential metric for measuring the effectiveness of data cleaning
strategies.
We use this metric to propose a widely applicable yet scalable experimental framework
for evaluating data cleaning strategies along three dimensions:
glitch improvement, statistical distortion and cost-related criteria.
Existing metrics focus on glitch improvement and cost, but not on the
statistical impact of data cleaning strategies.
We illustrate our framework on real world data, with a comprehensive suite of
experiments and analyses.}
Effect of Data Repair on Mining Network Streams
Ji Loh, Tamraparni Dasu
ICDM 2012,
2012.
[PDF]
[BIB]
IEEE Copyright
This version of the work is reprinted here with permission of IEEE for your personal use. Not for redistribution. The definitive version was published in 2012 , 2012-12-15
{Data quality issues have special implications in network data. Data glitches are
propagated rapidly along pathways dictated by the hierarchy and topology of the
network. In this paper, we use temporal data from a vast data network to study data glitches and
their effect on network monitoring tasks such as anomaly detection. We demonstrate
the consequences of
cleaning the data, and develop targeted and
customized cleaning strategies by exploiting the network hierarchy. }
Data Glitches: Monsters in your Data
Tamraparni Dasu
Handbook, Reference book,
2012.
[PDF]
[BIB]
{}
Unsupervised Clustering of Multidimensional Distributions using Earth Mover Distance
Simon Urbanek, Tamraparni Dasu, Shankar Krishnan, David Applegate
ACM KDD,
2011.
[PDF]
[BIB]
ACM Copyright
(c) ACM, 20XX. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in ACM KDD , 2011-08-01.
{Multidimensional distributions are often used in data min- ing to describe and summarize different features of large datasets. It is natural to look for distinct classes in such datasets by clustering the data. A common approach entails the use of methods like k-means clustering. However, the k-means method inherently relies on the Euclidean metric in the embedded space and does not account for additional topology underlying the distribution.
In this paper, we propose using Earth Mover Distance (EMD) to compare multidimensional distributions. For a n-bin histogram, the EMD is based on a solution to the transportation problem with time complexity O(n3 log n). To mitigate the high computational cost of EMD, we pro- pose an approximation that reduces the cost to linear time.
Other notions of distances such as the information theo- retic Kullback-Leibler divergence and statistical χ2 distance, account only for the correspondence between bins with the same index, and do not use information across bins, and are sensitive to bin size. A cross-bin distance measure like EMD is not affected by binning differences and meaningfully matches the perceptual notion of “nearness”.
Our technique is simple, efficient and practical for clus- tering distributions. We demonstrate the use of EMD on a practical application of clustering over 400,000 anonymous mobility usage patterns which are defined as distributions over a manifold. EMD allows us to represent inherent re- lationships in this space. We show that EMD allows us to successfully cluster even sparse signatures and we compare the results with other clustering methods. Given the large size of our dataset a fast approximation is crucial for this application.}

Robustness of Stream Mining Algorithms
Tamraparni Dasu, Gina-Maria Pomann, Shankar Krishnan
IDA 2011,
2011.
[PDF]
[BIB]
Springer Copyright
The definitive version was published in IDA 2011. , 2011-12-31
{Stream mining is a challenging problem that has attracted
considerable attention in the last decade. As a result there are
numerous algorithms for mining data streams, from summarizing
and analyzing, to change and anomaly
detection. However, most research focuses on proposing, adapting or
improving algorithms and studying their computational performance.
For a practitioner of stream mining, there is very little guidance on
choosing a technology suited for a particular task or application.
In this paper, we address the practical aspect of choosing a
suitable algorithm
by drawing on the statistical properties of {em power} and
{em robustness}. For the purpose of illustration, we focus on
change detection algorithms (CDAs). We define
an objective performance measure, {it streaming power},
and use it to explore the robustness of three different algorithms.
The measure is
comparable for disparate algorithms,
and provides a common framework for comparing and evaluating
change detection algorithms on any data set
in a meaningful fashion. We demonstrate on real world applications,
and on synthetic
data.
In addition, we present a
repository of data streams for the community to test change detection
algorithms for streaming data. }

Discovery of Complex Glitch Patterns: A Novel Approach to Quantitative Data Cleaning
Laure Berti-Equille, Tamraparni Dasu, Divesh Srivastava
IEEE International Conference on Data Engineering (ICDE),
2011.
[BIB]
{Data quality mining (DQM) is the use of data mining techniques to
detect, quantify, explain, and correct data quality problems. Current
DQM approaches focus on addressing each category of data
glitch separately. However, in real-life data, different types of data
glitches co-occur in complex patterns. These patterns and interactions
between glitches offer valuable clues for developing effective
data cleaning strategies that are more informed than blind, predefined
strategies.
In this paper, we propose a novel data quality mining framework
for the comprehensive definition, detection and cleaning of complex,
multi-type data glitches. We exploit the distributions and interaction
of different types of glitches to develop data-driven cleaning
strategies that offer significant advantages over blind strategies.
We develop a statistically rigorous framework for glitch scoring,
strategy evaluation and the selection of an optimal strategy from
the space of candidate quantitative cleaning strategies. We demonstrate
the efficacy and scalability of our framework on very large
real and synthetic data sets.}
Method For Automated Detection Of Data Glitches In Large Data Sets,
Tue Sep 28 15:50:00 EDT 2010
Embodiments of the invention allow the efficient detection of glitches in a set of data. In one embodiment, a set of datapoints is received. The datapoints are transformed into a transformed space, and the transformed space is segmented into a plurality of regions. For each datapoint, a time series is generated representing the trajectory of the transformed datapoint through regions of the segmented transformed space. Data corresponding to transformed datapoints whose trajectories exhibit an unusual pattern are transmitted.
Monitoring Complex Data Feeds Through Ensemble Testing,
Tue Jun 29 15:50:32 EDT 2010
Managing and monitoring multiple complex data feeds is a major challenge for data mining tasks in large corporations and scientific endeavors alike. The invention describes an effective method for flagging abnormalities in data feeds using an ensemble of statistical tests that may be used on complex data feeds. The tests in the ensemble are chosen such that the speed and ability to deliver real time decisions are not compromised.