180 Park Ave - Building 103

Florham Park, NJ

http://public.research.att.com/~barna/

Subject matter expert in Algorithms, Probabilistic Methods, Databases

Her research interest spans several areas of Theoretical Computer Science (TCS) and Databases. Specifically, under TCS, she works on algorithm design and analysis, hardness of approximation, probabilistic methods and combinatorial optimization. In Databases, her major focus is on research related to data quality, data integration and distributed data management such as in cloud.

On Repairing Structural Problems in Semi-structured Data

Philip Korn, Barna Saha, Divesh Srivastava, Shanshan Ying

VLDB 2013,
2013.
[PDF]
[BIB]

VLDB Foundation Copyright

The definitive version was published in Very Large Databases, 2013. , 2013-08-26

{Semi-structured data such as XML are popular for data interchange and storage. However, many XML documents have improper nesting where open- and close-tags are unmatched. Since some semi-structured data (e.g., Latex) have a flexible grammar and since many XML documents lack an accompanying DTD or XSD, we focus on computing a syntactic repair via the edit distance.
To solve this problem, we propose a dynamic programming algorithm which takes cubic time. While this algorithm is not scalable, well-formed substrings of the data can be pruned to enable faster computation. Unfortunately, there are still cases where the dynamic program could be very expensive; hence, we give branch-and-bound algorithms based on various combinations of two heuristics, called MinCost and MaxBenefit, that trade off between accuracy and efficiency. Finally, we experimentally demonstrate the performance of these algorithms on real data.}

Less is More: Selecting Sources Wisely for Integration

Divesh Srivastava, Xin Dong, Barna Saha

VLDB,
2013.
[PDF]
[BIB]

VLDB Foundation Copyright

The definitive version was published in Very Large Databases, 2012. , 2013-08-26

{We are often thrilled by the abundance of information surrounding
us and wish to integrate data from as many sources as possible.
However, understanding, analyzing, and using these data are often
hard. Too much data can introduce a huge integration cost, such
as expenses for purchasing data and resources for integration and
cleaning. Furthermore, including low-quality data can even
deteriorate the quality of integration results instead of bringing
the desired quality gain. Thus, ``the more the better'' does not
always hold for data integration and often ``less is more''.
In this paper, we study how to select a subset of sources before
integration such that we can balance the quality of integrated data
and integration cost. Inspired by the Marginalism principle in
economic theory, we wish to integrate a new source only if its
marginal gain, often a function of improved integration quality,
is higher than the marginal cost, associated with data-purchase
expense and integration resources. As a first step towards this
goal, we focus on data fusion tasks, where the goal is to resolve
conflicts from different sources. We propose a randomized
solution for selecting sources for fusion and show empirically its
effectiveness and scalability on both real-world data and synthetic
data.}

Set Cover revisited: Hypergraph Cover with Hard Capacities

Barna Saha, Samir Khuller

International Colloquium on Automata, Languages, and Programming (ICALP),
2012.
[PDF]
[BIB]

Springer-Verlag Copyright

The definitive version was published in 2012. , 2012-07-13

{In this paper, we consider generalizations of classical covering problems to handle hard capacities. In the hard capacitated set cover problem, additionally each set has a covering capacity which we are not allowed to exceed.
In other words, after picking a set, we may cover at most a speciﬁed number of
elements. Based on the classical results by Wolsey, an O(log n) approximation
follows for this problem.
Chuzhoy and Naor [FOCS 2002], ﬁrst studied the special case of unweighted
vertex cover with hard capacities and developed an elegant 3 approximation for
it based on rounding a natural LP relaxation. This was subsequently improved to
a 2 approximation by Gandhi et al. [ICALP 2003]. These results are surprising in
light of the fact that for weighted vertex cover with hard capacities, the problem is
at least as hard as set cover to approximate. Hence this separates the unweighted
problem from the weighted version.
The set cover hardness precludes the possibility of a constant factor approximation for the hard-capacitated vertex cover problem on weighted graphs. However, it was not known whether a better than logarithmic approximation is possible on unweighted multigraphs, i.e., graphs that may contain parallel edges. Neither the approach of Chuzhoy and Naor, nor the follow-up work of Gandhi et al. can handle the case of multigraphs. In fact, achieving a constant factor approximation for hard-capacitated vertex cover problem on unweighted multigraphs was posed as an open question in Chuzhoy and Naor’s work. In this paper, we resolve this question by providing the ﬁrst constant factor approximation algorithm for the vertex cover problem with hard capacities on unweighted multigraphs. Previous works cannot handle hypergraphs which is analogous to consider set systems where elements belong to at most f sets. In this paper, we give an O(f) approximation algorithm for this problem. Further, we extend these works to consider partial covers.}