180 Park Ave - Building 103

Florham Park, NJ

http://www.research.att.com/~flip

Subject matter expert in data mining, data quality

On Repairing Structural Problems in Semi-structured Data

Philip Korn, Barna Saha, Divesh Srivastava, Shanshan Ying

VLDB 2013,
2013.
[PDF]
[BIB]

VLDB Foundation Copyright

The definitive version was published in Very Large Databases, 2013. , 2013-08-26

{Semi-structured data such as XML are popular for data interchange and storage. However, many XML documents have improper nesting where open- and close-tags are unmatched. Since some semi-structured data (e.g., Latex) have a flexible grammar and since many XML documents lack an accompanying DTD or XSD, we focus on computing a syntactic repair via the edit distance.
To solve this problem, we propose a dynamic programming algorithm which takes cubic time. While this algorithm is not scalable, well-formed substrings of the data can be pruned to enable faster computation. Unfortunately, there are still cases where the dynamic program could be very expensive; hence, we give branch-and-bound algorithms based on various combinations of two heuristics, called MinCost and MaxBenefit, that trade off between accuracy and efficiency. Finally, we experimentally demonstrate the performance of these algorithms on real data.}

Data Auditor: Exploring Data Quality and Semantics using Pattern Tableaux

Lukasz Golab, Howard Karloff, Philip Korn, Divesh Srivastava

2010.
[PDF]
[BIB]

{We present Data Auditor, a tool for exploring data quality and semantics. Data Auditor provides a variety of semantic rule (constraint) classes for asserting consistency and completeness properties of the data. It can be used to test hypotheses by instantiating rules over supplied sets of attributes, to see if such rules hold or fail (to a required degree) on a minimum fraction of the data. Regions of the data for which the rule is satisfied (alternatively, violated) are concisely and meaningfully summarized in a tableau. }

Computing Time-Decayed Aggregates Under Smooth Decay Functions,
July 9, 2013
Aggregates are calculated from a data stream in which data is sent in a sequence of tuples, in which each tuple comprises an item identifier and a timestamp indicating when the tuple was transmitted. The tuples may arrive at a data receiver out-of-order, that is, the sequence in which the tuples arrive are not necessarily in the same sequence as their corresponding timestamps. In calculating aggregates, more recent data may be given more weight by a decay function which is a function of the timestamp associated with the tuple and the current time. The statistical characteristics of the tuples are summarized by a set of linear data summaries. The set of linear data summaries are generated such that only a single linear data summary falls between a set of boundaries calculated from the decay function and a set of timestamps. Aggregates are calculated from the set of linear data summaries.

Computing Time-Decayed Aggregates In Data Streams,
March 5, 2013
Aggregates are calculated from a data stream in which data is sent in a sequence of tuples, in which each tuple comprises an item identifier and a timestamp indicating when the tuple was transmitted. The tuples may arrive out-of-order, that is, the sequence in which the tuples arrive are not necessarily in the sequence of their corresponding timestamps. In calculating aggregates, more recent data may be given more weight by multiplying each tuple by a decay function which is a function of the timestamp associated with the tuple and the current time. The tuples are recorded in a quantile-digest data structure. Aggregates are calculated from the data stored in the quantile-digest data structure.

System And Method For Generating Statistical Descriptors For A Data Stream,
February 2, 2010
Described is a system and method for receiving a data stream of multi-dimensional items, collecting a sample of the data stream having a predetermined number of items and dividing the sample into a plurality of subsamples, each subsample corresponding to a single dimension of each of the predetermined number of items. A query is then executed on a particular item in at least two of the subsamples to generate data for the corresponding subsample. This data is combined into a single value.

System and Method For Identifying Hierarchical Heavy Hitters In A Multidimensional Environment,
September 15, 2009
A method including receiving a plurality of elements of a data stream, storing a multi-dimensional data structure in a memory, said multi-dimensional data structure storing the plurality of elements as a hierarchy of nodes, each node having a frequency count corresponding to the number of elements stored therein, comparing the frequency count of each node to a threshold value based on a total number of the elements stored in the nodes and identifying each node for which the frequency count is at least as great as the threshold value as a hierarchical heavy hitter (HHH) node and propagating the frequency count of each non-HHH nodes to its corresponding parent nodes.

Multidimensional substring selectivity estimation using set hashing of cross-counts,
May 18, 2004
An approach for multidimensional substring selectivity estimation utilizes set hashing to generate cross-counts as needed, instead of storing cross-counts for the most frequently co-occurring substrings. Set hashing is a Monte Carlo technique that is used to succinctly represent the set of tuples containing a given substring. Then, any combination of set hashes will yield a cross-count when intersected. Thus, the set hashing technique is useful in three-, four- and other multidimensional situations, since only an intersection function is required.