
180 Park Ave - Building 103
Florham Park, NJ
Scheduling to Minimize Staleness and Stretch in Real-Time Data Warehouses
Lukasz Golab, MohammadHossein Bateni, Mohammad Hajiaghayi, Howard Karloff
Theory of Computing Systems Journal,
2011.
[PDF]
[BIB]
Springer Science+Business Media Copyright
The definitive version was published in Theory of Computing Systems Journal. , 2011-07-01
{We study scheduling algorithms for loading data feeds into real time data
warehouses, which are used in applications such as IP network monitoring, online financial
trading, and credit card fraud detection. In these applications, the warehouse
collects a large number of streaming data feeds that are generated by external
sources and arrive asynchronously. Data for each table in the warehouse are generated at a
constant rate, different tables possibly at different rates.
For each data feed, the arrival of new data triggers an emph{update} that seeks to append
the new data to the corresponding table; if multiple updates are pending for the
same table, they are batched together before being loaded.
At time $ au$, if a table has been updated with information up to time $r le au$,
its emph{staleness} is defined as $ au - r$.
Our first objective is to schedule the updates on one or more processors in a way that
minimizes the total staleness.
In order to ensure fairness, our second objective is to limit the maximum ``stretch'',
which we define (roughly) as the
ratio between the duration of time an update waits till it is
finished being processed,
and the length of the update.
In contrast to earlier work proving the nonexistence of constant-competitive algorithms
for related scheduling problems, we prove that emph{any} online nonpreemptive algorithm,
no processor of which is ever voluntarily idle, incurs a staleness at most a constant
factor larger than an obvious lower bound on total staleness (provided that the processors
are sufficiently fast).
We give a constant-stretch algorithm,
provided that the processors are sufficiently fast,
for the emph{quasiperiodic model},
in which tables can be clustered into a few groups such that the update frequencies
within each group vary by at most a constant factor.
Finally, we show that our constant-stretch algorithm is also
constant-competitive (subject to the same proviso on processor speed)
in the quasiperiodic model with respect to total weighted staleness, where tables
are assigned weights that reflect their priorities.}

Capacitated Metric Labeling
Howard Karloff, Mohammad Hajiaghayi, Matthew Andrews, Ankur Moitra
ACM-SIAM Symposium on Discrete Algorithms (SODA),
2011.
[PDF]
[BIB]
SIAM Copyright
The definitive version was published in ACM-SIAM Symposium on Discrete Algorithms (SODA) , 2011-01-23
{We introduce
Capacitated Metric Labeling. As in Metric Labeling, we
are given a weighted graph G = (V, E), a label set L, a
semimetric d_L on this label set, and an assignment cost
function phi: V x L -> R, and the goal, as
in Metric Labeling, is to find an assignment f: V->L
that minimizes a particular two-cost function,
but in Capacitated Metric Labeling with the additional restriction that each label t_i receive at most l_i nodes.
We give a O(log |V|)-approximation algorithm when the number of labels is
fixed. We prove that it is impossible to approximate the value
of an instance of Capacitated Metric Labeling to within
any finite factor, if P!=NP.
Furthermore, we prove that any finite-ratio polynomial-time
approximation algorithm must multiplicatively violate the label
capacities by Omega((log |L|)^(1/2-eps)).
}
Data Auditor: Exploring Data Quality and Semantics using Pattern Tableaux
Lukasz Golab, Howard Karloff, Philip Korn, Divesh Srivastava
2010.
[PDF]
[BIB]
{We present Data Auditor, a tool for exploring data quality and semantics. Data Auditor provides a variety of semantic rule (constraint) classes for asserting consistency and completeness properties of the data. It can be used to test hypotheses by instantiating rules over supplied sets of attributes, to see if such rules hold or fail (to a required degree) on a minimum fraction of the data. Regions of the data for which the rule is satisfied (alternatively, violated) are concisely and meaningfully summarized in a tableau. }
Improved Approximation Algorithms For Label Cover Problems
Howard Karloff, Mohammad Hajiaghayi, Moses Charikar
European Symposium on Algorithms,
2009.
[PDF]
[BIB]
Springer-Verlag Copyright
The definitive version was published in European Symposium on Algorithms , 2009-09-07
{}
A Technique for Improving Approximation Algorithms for Prize-Collecting Problems
Aaron Archer, Howard Karloff, Mohammad Hajiaghayi, MohammadHossein Bateni
2008.
[PDF]
[BIB]
Compressing Rectilinear Pictures And Minimizing Access Control Lists,
June 7, 2011
A geometric model is considered for the problem of minimizing access control lists (ACLs) in network routers. A colored rectilinear pattern is created within an initially white rectangular canvas, and the basic operation is to choose a subrectangle and paint it a single color, overwriting all previous colors in the rectangle. The method operates on rectangular rule lists (RRLs) and access control lists (ACLs) in which all rectangles are strips that extend either the full length or the full height of the canvas. A polynomial-time algorithm optimally constructs such patterns when, as in the ACL application, the only colors are black and white (permit or deny). That algorithm is complemented by a significantly faster approximation algorithm that is guaranteed to be no worse than 3/2 optimal.
Method For Selecting Nodes In A Network,
April 27, 2010
Given a set of network nodes B that are sought to be monitored, and a set of potential monitoring nodes, a subset M of the monitoring nodes is chosen that insures monitoring each node b in B with a pair of nodes m.sub.i and m.sub.j such that no node except b is on both any shortest path from b to m.sub.i and on any shortest path from b to m.sub.j. Some of the nodes in M are chosen in a first step by identifying a subset of B having nodes b that are "t-good" nodes, choosing a subset of potential monitoring nodes as First Partner nodes, and choosing a corresponding subset of potential monitoring nodes as Second Partner nodes. Others are chosen in a second step that handles nodes b that are not "t-good," using a greedy algorithm.
ACM Fellow, 2011.
For contributions to the design and analysis of algorithms