
180 Park Ave - Building 103
Florham Park, NJ
http://www2.research.att.com/~johnsont/
Subject matter expert in Data warehouse, data stream, stream warehouse, data quality
I received a B.S in Mathematics from the Johns Hopkins University in 1986, and a Ph.D. in Computer Science from the Courant Institute of NYU in 1990., receiving the Janet Fabri award for best dissertation in the department. From 1990 through 1996, I was an Assistant Professor, then Associate Professor, in the Computer and Information Science and Engineering department at the University of Florida. In 1996 I joined the Database Research department of AT&T Labs - Research, abd have been here since. In 2004 I received an AT&T Science and Technology for my work in the Bellman database browser. I have authored two books, "Distributed Operating Systems & Algorithms" (with Randy Chow), and "Exploratory Data Mining and Data Cleaning" (with Tamraparni Dasu).
DataDepot,
DataDepot and its associated toolchain enable the fast construction and easy maintenance of very large real-time stream warehouses.
Bistro Data Feed Management System
Vladislav Shkapenyuk, Divesh Srivastava, Theodore Johnson
ACM SIGMOD Conference,
2011.
[PDF]
[BIB]
ACM Copyright
(c) ACM, 2011. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in ACM SIGMOD Conference , 2011-06-12.
{Data feed management is a critical component of many data intensive applications that depend on reliable data delivery to support real-time data collection, correlation and analysis. Data is typically collected from a wide variety of sources and organizations, using a range of mechanisms - some data are streamed in real time, while other data are obtained at regular intervals or collected in an ad hoc fashion. Individual applications are forced to make separate arrangements with feed providers, learn the structure of incoming files, monitor data quality, and trigger any processing necessary. The Bistro data feed manager, designed and implemented at AT&T Labs- Research, simplifies and automates this complex task of data feed management: efficiently handling incoming raw files, identifying data feeds and distributing them to remote subscribers.
Bistro supports a flexible specification language to define logical data feeds using the naming structure of physical data files, and to identify feed subscribers. Based on the specification, Bistro matches data files to feeds, performs file normalization and compression, efficiently delivers files, and notifies subscribers using a trigger mechanism. We describe our feed analyzer that discovers the naming structure of incoming data files to detect new feeds, dropped feeds, feed changes, or lost data in an existing feed. Bistro is currently deployed within AT&T Labs and is responsible for the real-time delivery of over 100 different raw feeds, distributing data to several large-scale stream warehouses.
}
Methods And Apparatus For Querying Multiple Data Streams,
Tue Feb 14 16:09:21 EST 2012
A method and apparatus for querying multiple streams of data traffic are disclosed. In one embodiment, the method specifies interfaces or data sources in accordance with their properties. In turn, the method then accepts a query for querying multiple data streams and automatically constructs a set of queries in accordance with at least one specified property, where the set of queries is to be applied over the multiple data streams.
Optimized Field Unpacking For A Data Stream Management System,
Tue Feb 07 16:09:13 EST 2012
Two methods and computer-readable medium for obtaining information using field group unpacking functions. The first method obtains information using field group unpacking functions by identifying an optimized unpacking function from field group unpacking functions, and an optimized unpacking function is used to unpack a field associated with the data stream. The second method obtains information using field group unpacking functions by identifying an optimized unpacking function from the field group unpacking functions. Then, a prefilter is applied and associated with the optimized unpacking functions and used to unpack a field associated with the data stream. The computer-readable medium obtains field group unpacking functions for execution by a computing device using field group unpacking functions that identify an optimized unpacking function from the field group unpacking functions, and use an optimized unpacking function to unpack a field associated with the data stream.
Method And Apparatus For Packet Analysis In A Network,
Tue Jan 17 16:09:00 EST 2012
A method and system for extracting useful statistics and information and removing a processing module based on the information to enhance a run-time system on a network interface card is disclosed. The run-time system module feeds information derived from a network packet to processing modules which process the information and generate output such as condensed statistics about the packets traveling through the network. The run-time system can be enhanced to included facilities for removing processing modules without replacing the run-time system module.
Complex Dependencies For Efficient Data Warehouse Updates,
Tue Feb 01 16:01:48 EST 2011
The invention relates to a method of updating a data storage system. The method updates a raw database using an input data stream based on an input temporal value associated with the input data stream and a raw temporal value associated with the raw database. The method includes updating a derived database associated with the data storage system using the updated raw database based on the input temporal value, a derived temporal value and a user-defined relationship, the derived temporal value being associated with the derived database. The invention also relates to a computer-readable medium. The computer readable medium including instructions, wherein execution of the instructions by at least one computing device updates a data storage system. The invention further relates to a data storage system. The system includes a raw database, a derived database and a computing device operatively coupled to the raw database and the derived database.
Method And Apparatus For Packet Analysis In A Network,
Tue Nov 09 15:50:00 EST 2010
A method and system for monitoring traffic in a data communication network and for extracting useful statistics and information is disclosed.
Method For Automated Detection Of Data Glitches In Large Data Sets,
Tue Sep 28 15:50:00 EDT 2010
Embodiments of the invention allow the efficient detection of glitches in a set of data. In one embodiment, a set of datapoints is received. The datapoints are transformed into a transformed space, and the transformed space is segmented into a plurality of regions. For each datapoint, a time series is generated representing the trajectory of the transformed datapoint through regions of the segmented transformed space. Data corresponding to transformed datapoints whose trajectories exhibit an unusual pattern are transmitted.
Query-Aware Sampling Of Data Streams,
Tue May 19 15:38:00 EDT 2009
A system, method and computer-readable medium provide for assigning sampling methods to each input stream for arbitrary query sets in a data stream management system. The method embodiment comprises splitting all query nodes in a query directed acyclic graph (DAG) having multiple parent nodes into sets of independent nodes having a single parent, computing a grouping set for every node in each set of independent nodes, reconciling each parent node with each child node in each set of independent node, reconciling between multiple child nodes that share a parent node and generating a final grouping set for at least one node describing how to sample an input stream for that node.
Method and apparatus for packet analysis in a network,
Tue Nov 11 18:13:14 EST 2008
A method and system for monitoring traffic in a data communication network and for extracting useful statistics and information is disclosed. In accordance with an embodiment of the invention, a network interface card has a run-time system and one or more processing blocks executing on the network interface. The run-time system module feeds information derived from a network packet to the processing modules which process the information and generate output such as condensed statistics about the packets traveling through the network.
Method and apparatus for packet analysis in a network,
Tue Jan 16 18:11:49 EST 2007
A method and system for monitoring traffic in a data communication network and for extracting useful statistics and information is disclosed. In accordance with an embodiment of the invention, a network interface card has a run-time system and one or more processing blocks executing on the network interface. The run-time system module feeds information derived from a network packet to the processing modules which process the information and generate output such as condensed statistics about the packets traveling through the network.
Method and system for squashing a large data set,
Tue Mar 25 18:08:39 EST 2003
Apparatus and method for summarizing an original large data set with a representative data set. The data elements in both the original data set and the representative data set have the same variables, but there are significantly fewer data elements in the representative data set. Each data element in the representative data set has an associated weight, representing the degree of compression. There are three steps for constructing the representative data set. First, the original data elements are partitioned into separate bins. Second, moments of the data elements partitioned in each bin are calculated. Finally, the representative data set is generated by finding data elements and associated weights having substantially the same moments as the original data set.
Method and apparatus for querying a cube forest data structure,
Tue Jul 23 18:08:21 EDT 2002
A device and method is disclosed for using a data structure known as a cube forest for use in a batch-load-then-read-intensively system. The device and method significantly improve the time to execute a bit vector query. Hierarchically split cube forests provide a method for efficiently duplicating information, and can be optimized to reduce update and storage costs. Cube forests including hierarchically split cube forests are most appropriate for read-intensive, update-rarely-and-in-large-batches multidimensional applications in an off-the-shelf (low cost) hardware environment. A method and an apparatus for querying a cube forest for aggregates are disclosed herein.
System and method for storage media group parity protection,
Tue May 21 18:08:02 EDT 2002
A system and method for storage medium group parity protection stores data files and related parity information asynchronously on an array of storage media. Data files can be stored asynchronously, or synchronously in stripes as in RAIT technology, but related parity information is stored asynchronously with respect to the data files. Regions of the storage media are preferably organized into protection groups for which parity information is generated. Parity information is generated on line as data files are stored and maintained in active memory. Once a protection group is filled, the parity information is migrated to more permanent backup storage. As one example, regions of an array of N storage media can constitute a protection group, and once the regions in the protection group are filled with data, parity data for the protection group is migrated from active memory to more permanent backup storage.
Method and apparatus for loading data into a cube forest data structure,
Tue Dec 25 18:07:19 EST 2001
A device and method is disclosed for loading data into and updating a data structure known as a cube forest for use in a batch-load-then-read-intensively system. The device and method perform loading and updating functions efficiently. Hierarchically split cube forests provide a method for efficiently duplicating information, and can be optimized to reduce update and storage costs. Cube forests including hierarchically split cube forests are most appropriate for read- intensive, update-rarely-and-in-large-batches multidimensional applications in an off-the-shelf (low cost) hardware environment. A method and an apparatus for loading data into and updating a cube forest are disclosed herein.
System and method for storage media group parity protection,
Tue Sep 11 18:07:13 EDT 2001
A system and method for storage medium group parity protection stores data files and related parity information asynchronously on an array of storage media. Data files can be stored asynchronously, or synchronously in stripes as in RAIT technology, but related parity information is stored asynchronously with respect to the data files. Regions of the storage media are preferably organized into protection groups for which parity information is generated. Parity information is generated on line as data files are stored and maintained in active memory. Once a protection group is filled, the parity information is migrated to more permanent backup storage. As one example, regions of an array of N storage media can constitute a protection group, and once the regions in the protection group are filled with data, parity data for the protection group is migrated from active memory to more permanent backup storage.
Fault-Tolerant Storage System,
Tue Apr 17 18:07:02 EDT 2001
The present invention is a storage system, and method of operation thereof, which provides improved performance over standard RAID-5 without increasing vulnerability to single-disk drive failures. The storage system comprises a processor and a plurality of data storage devices, coupled to the processor, operable to store a plurality of data stripes, each data stripe comprising a plurality of data blocks and a parity block, each data storage device operable to store one data block or the parity block of each data stripe. The storage system stores a dirty stripe bit vector of a data stripe. When an update to a data block in the data stripe is received, an image of the data block as it was when the dirty stripe bit vector was generated is stored. The data block is updated and an image of the updated data block is stored. When a failure of one of the plurality of data storage devices is detected, a bitwise exclusive-OR of the image of the data block as it was when the dirty stripe bit vector was generated and the image of the updated data block to form an intermediate result is generated. The parity block of the data stripe is read and a bitwise exclusive-OR of the intermediate result and the parity block is generated. The generated parity block is written and a parity rebuild is performed on the data stripe using the new parity block.
Coarse Indexes for a Data Warehouse,
Tue Apr 10 18:07:02 EDT 2001
A coarse database index, and system and method of use therefor, that will quickly indicate which data partitions of a table contain a given key. Once the target data partitions are located, the exact record locations can be found using traditional indexes. The coarse indexes take little space, can be updated quickly, and searched quickly. The coarse index is in conjunction with a database including a plurality of data partitions. Each data partition includes data, including a plurality of key values of at least one key, and at least one dense index referencing the data. The coarse index indexing the plurality of key values according to data partitions containing each key value. The coarse index includes a first bitmap, which is preferably arranged in key value major format. The coarse index may also include a second bitmap, which is preferably arranged in data partition major format. The second bitmap may be transformed from data partition major format to key value major format. The first and second bitmap partitions may be compressed.
Method and apparatus for analyzing co-evolving time sequences,
Tue Apr 25 18:05:32 EDT 2000
An analyzer system that analyzes a plurality of co-evolving time sequences to, for example, perform correlation or outlier detection on the time sequences. The plurality of co-evolving time sequences comprise a delayed time sequence and one or more known time sequences. A goal is to predict the delayed value given the available information. The plurality of time sequences have a present value and (N-1) past values, where N is the number of samples (time-ticks) of each time sequence. The analyzer system receives the plurality of co-evolving time sequences and determines a window size (w). The analyzer then assigns the delayed time sequence as a dependent variable and the present value of a subset of the known time sequences, and the past values of the subset of known time sequences and the delayed time sequence, as a plurality of independent variables. Past values delayed by up to w steps are considered. The analyzer then forms an equation comprising the dependent variable and the independent variables, and then solves the equation using a least squares method. The delayed time sequence is then determined using the solved equation.
AT&T Fellow, 2010.
For outstanding contributions in database and data stream processing systems.
DataDepot,
DataDepot and its associated toolchain enable the fast construction and easy maintenance of very large real-time stream warehouses.