
180 Park Ave - Building 103, Rm B157
Florham Park, NJ
www.research.att.com/~lgolab
Subject matter expert in Data stream processing, Data warehousing, Data quality
Scheduling to Minimize Staleness and Stretch in Real-Time Data Warehouses
Lukasz Golab, MohammadHossein Bateni, Mohammad Hajiaghayi, Howard Karloff
Theory of Computing Systems Journal,
2011.
[PDF]
[BIB]
Springer Science+Business Media Copyright
The definitive version was published in Theory of Computing Systems Journal. , 2011-07-01
{We study scheduling algorithms for loading data feeds into real time data
warehouses, which are used in applications such as IP network monitoring, online financial
trading, and credit card fraud detection. In these applications, the warehouse
collects a large number of streaming data feeds that are generated by external
sources and arrive asynchronously. Data for each table in the warehouse are generated at a
constant rate, different tables possibly at different rates.
For each data feed, the arrival of new data triggers an emph{update} that seeks to append
the new data to the corresponding table; if multiple updates are pending for the
same table, they are batched together before being loaded.
At time $ au$, if a table has been updated with information up to time $r le au$,
its emph{staleness} is defined as $ au - r$.
Our first objective is to schedule the updates on one or more processors in a way that
minimizes the total staleness.
In order to ensure fairness, our second objective is to limit the maximum ``stretch'',
which we define (roughly) as the
ratio between the duration of time an update waits till it is
finished being processed,
and the length of the update.
In contrast to earlier work proving the nonexistence of constant-competitive algorithms
for related scheduling problems, we prove that emph{any} online nonpreemptive algorithm,
no processor of which is ever voluntarily idle, incurs a staleness at most a constant
factor larger than an obvious lower bound on total staleness (provided that the processors
are sufficiently fast).
We give a constant-stretch algorithm,
provided that the processors are sufficiently fast,
for the emph{quasiperiodic model},
in which tables can be clustered into a few groups such that the update frequencies
within each group vary by at most a constant factor.
Finally, we show that our constant-stretch algorithm is also
constant-competitive (subject to the same proviso on processor speed)
in the quasiperiodic model with respect to total weighted staleness, where tables
are assigned weights that reflect their priorities.}

Consistency in a Stream Warehouse
Lukasz Golab, Theodore Johnson
CIDR conference,
2011.
[PDF]
[BIB]
{A stream warehouse is a Data Stream Management System (DSMS) that stores a very long history, e.g. years or decades; or equivalently a data warehouse that is continuously loaded. A stream warehouse enables queries that seamlessly range from real-time alerting and diagnostics to long-term data mining. However, continuously loading data from uncontrolled sources into a real-time stream warehouse introduces a new consistency problem: users want results in as timely a fashion as possible, but �stable� results often require lengthy synchronization delays. In this paper we develop a theory of consistency for stream warehouses that allows for multiple consistency levels, we show how to restrict query answers to a given consistency level, and we show how warehouse maintenance can be optimized using knowledge of the consistency levels required by materialized views.}
Stream Data Management
Lukasz Golab, M. Tamer Ozsu
Morgan & Claypool, Synthesis Lectures on Data Management,
2010.
[BIB]
{Many applications process high volumes of streaming data. Examples include Internet traffic analysis, sensor networks, Web server and
error log mining, financial tickers and on-line trading, real-time mining of telephone call records or credit card transactions, tracking the
GPS coordinates of moving objects, and analyzing the results of scientific experiments. In general, a data stream is a data set that is
produced incrementally over time, rather than being available in full before its processing begins. Of course, completely static
data are not practical, and even traditional databases may be updated over time. However, new problems arise when processing
unbounded streams in nearly real time. In this lecture, we survey these problems and their solutions.}
Data Auditor: Exploring Data Quality and Semantics using Pattern Tableaux
Lukasz Golab, Howard Karloff, Philip Korn, Divesh Srivastava
2010.
[PDF]
[BIB]
{We present Data Auditor, a tool for exploring data quality and semantics. Data Auditor provides a variety of semantic rule (constraint) classes for asserting consistency and completeness properties of the data. It can be used to test hypotheses by instantiating rules over supplied sets of attributes, to see if such rules hold or fail (to a required degree) on a minimum fraction of the data. Regions of the data for which the rule is satisfied (alternatively, violated) are concisely and meaningfully summarized in a tableau. }