
180 Park Ave - Building 103
Florham Park, NJ
Bistro Data Feed Management System
Vladislav Shkapenyuk, Divesh Srivastava, Theodore Johnson
ACM SIGMOD Conference,
2011.
[PDF]
[BIB]
ACM Copyright
(c) ACM, 2011. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in ACM SIGMOD Conference , 2011-06-12.
{Data feed management is a critical component of many data intensive applications that depend on reliable data delivery to support real-time data collection, correlation and analysis. Data is typically collected from a wide variety of sources and organizations, using a range of mechanisms - some data are streamed in real time, while other data are obtained at regular intervals or collected in an ad hoc fashion. Individual applications are forced to make separate arrangements with feed providers, learn the structure of incoming files, monitor data quality, and trigger any processing necessary. The Bistro data feed manager, designed and implemented at AT&T Labs- Research, simplifies and automates this complex task of data feed management: efficiently handling incoming raw files, identifying data feeds and distributing them to remote subscribers.
Bistro supports a flexible specification language to define logical data feeds using the naming structure of physical data files, and to identify feed subscribers. Based on the specification, Bistro matches data files to feeds, performs file normalization and compression, efficiently delivers files, and notifies subscribers using a trigger mechanism. We describe our feed analyzer that discovers the naming structure of incoming data files to detect new feeds, dropped feeds, feed changes, or lost data in an existing feed. Bistro is currently deployed within AT&T Labs and is responsible for the real-time delivery of over 100 different raw feeds, distributing data to several large-scale stream warehouses.
}
Method For Scheduling Updates In A Streaming Data Warehouse,
Tue May 28 17:26:44 EDT 2013
A method for scheduling atomic update jobs to a streaming data warehouse includes allocating execution tracks for executing the update jobs. The tracks may be assigned a portion of available processor utilization and memory. A database table may be associated with a given track. An update job directed to the database table may be dispatched to the given track for the database table, when the track is available. When the track is not available, the update job may be executed on a different track. Furthermore, pending update jobs directed to common database tables may be combined and separated in certain transient conditions.
Query-Aware Sampling Of Data Streams,
Tue Jan 31 16:09:11 EST 2012
A system, method and computer-readable medium provide for assigning sampling methods to each input stream for arbitrary query sets in a data stream management system. The method embodiment comprises splitting all query nodes in a query directed acyclic graph (DAG) having multiple parent nodes into sets of independent nodes having a single parent, computing a grouping set for every node in each set of independent nodes, reconciling each parent node with each child node in each set of independent node, reconciling between multiple child nodes that share a parent node and generating a final grouping set for at least one node describing how to sample an input stream for that node.
Query-Aware Sampling Of Data Streams,
Tue May 19 15:38:00 EDT 2009
A system, method and computer-readable medium provide for assigning sampling methods to each input stream for arbitrary query sets in a data stream management system. The method embodiment comprises splitting all query nodes in a query directed acyclic graph (DAG) having multiple parent nodes into sets of independent nodes having a single parent, computing a grouping set for every node in each set of independent nodes, reconciling each parent node with each child node in each set of independent node, reconciling between multiple child nodes that share a parent node and generating a final grouping set for at least one node describing how to sample an input stream for that node.