
180 Park Ave - Building 103
Florham Park, NJ
Accurate and Efficient Private Release of Datacubes and Contingency Tables
Graham Cormode, Cecilia Procopiuc, Divesh Srivastava, Grigory Yaroslavtsev
IEEE International Conference on Data Engineering,
2013.
[PDF]
[BIB]
IEEE Copyright
This version of the work is reprinted here with permission of IEEE for your personal use. Not for redistribution. The definitive version was published in 2012 , 2013-04-08
{A central problem in releasing information is to do so accurately while providing a privacy guarantee on the output. Recently, there has been much work which guarantees differential privacy on the output, and aims to maximize the utility within this guarantee, for the class of {em linear queries}, which include basic counting queries, data cubes, and contingency tables. Much work follows a common template: pick a ``strategy'' set of linear queries to apply to the data, then use the noisy answers to these queries to reconstruct the queries of interest. This entails either picking a strategy set that is hoped to be good for the queries, or by performing an exhaustive search over the space of all possible strategies.
In this paper, we propose a new approach that balances accuracy and
efficiency: we show how to optimize the accuracy of a given strategy by
answering some strategy queries more accurately than others, based on the target queries. This leads to an efficient optimal noise allocation for many popular strategies, including wavelets, hierarchies, Fourier coefficients and more.
For the important case of marginal queries (equivalently, subsets of the data cube), we show that this strictly improves on previous methods, both analytically and empirically. Our results also extend to ensuring that the returned query answers are consistent with an (unknown) data set at minimal extra cost in
terms of time and noise.
}

Summary Graphs for Relational Database Schemas
Cecilia Procopiuc, Divesh Srivastava, Xiaoyan Yang
VLDB Conference,
2011.
[PDF]
[BIB]
VLDB Foundation Copyright
The definitive version was published in Very Large Databases, 2011. , 2011-08-29
{Increasingly complex databases need ever more sophisticated tools
to help users understand their schemas and interact with the data.
Existing tools fall short of either providing the ``big picture,''
or of presenting useful connectivity information.
In this paper we define summary graphs, a novel approach for
summarizing schemas. Given a set of user-specified query tables,
the summary graph automatically computes the most relevant tables
and joins for that query set. The output preserves the most
informative join paths between the query tables, while meeting
size constraints. In the process, we define a novel
information-theoretic measure over join edges. Unlike most
subgraph extraction work, we allow metaedges (i.e., edges in the
transitive closure) to help reduce output complexity. We prove
that the problem is NP-Hard, and solve it as an integer program.
Our extensive experimental study shows that our method returns
high-quality summaries under independent quality measures.}

Automatic Discovery of Attributes in Relational Databases
Meihui Zhang, Beng Chin Ooi, Divesh Srivastava, Cecilia Procopiuc, Marios Hadjieleftheriou
ACM SIGMOD 2011,
2011.
[PDF]
[BIB]
ACM Copyright
(c) ACM, 2011. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in ACM SIGMOD 2011 , 2011-06-12, http://www.acm.org/.
{In this work we try to cluster relational columns into attributes, i.e., to identify strong relationships between columns
based on the common properties and characteristics of the values they contain.
For example, identifying whether a certain set of columns refers
to telephone numbers versus social security numbers, or names of customers versus names of employees.
Traditional relational database schema languages use very limited primitive data types and
simple foreign key constraints to express relationships between columns. Object oriented schema
languages allow the definition of custom data types, but still, certain relationships between
columns might be unknown at design time or they might appear only in a particular database instance.
Nevertheless, these relationships are an invaluable tool for schema matching, and generally for
better understanding and working with the data. Here, we introduce data oriented solutions (we do not consider solutions that
assume the existence of any external knowledge), that use statistical measures to identify strong relationships
between the values of a set of columns. Interpreting the
database as a graph where nodes correspond to database columns and edges correspond to column
relationships, we decompose the graph into connected components and cluster sets of columns into
attributes. To test the quality of our solutions, we also provide a
comprehensive experimental evaluation using real and synthetic datasets.
}
Rectangular Layouts and Contact Graphs
Adam L. Buchsbaum, Emden R. Gansner, Cecilia Magdalena Procopiuc, Suresh Venkatasubramanian
CoRR,
vabs/cs/0611107,
2006.
[BIB]
Method And System For Facility Location Optimization,
Tue Jan 29 14:43:59 EST 2013
Systems and methods for optimization of facility locations are disclosed, for example, wireless telecommunications facility locations. Among a plurality of cluster points, corresponding to wireless customers, optimal locations are determined for a predetermined number of cluster centers, each cluster center having a predetermined cluster radius that defines a cluster area. Among a plurality of cluster points, optimal facility locations are determined for a variable number of cluster centers, each cluster center having a minimum acceptable economic value.
Computer Systems, Methods And Computer Program Products For Dta Anonymization For Aggregate Query Answering,
Tue Feb 07 16:09:16 EST 2012
Computer program products are provided for anonymizing a database that includes tuples. A respective tuple includes at least one quasi-identifier and sensitive attributes associated with the quasi-identifier. These computer program products include computer readable program code that is configured to (k,e)-anonymize the tuples over a number k of different values in a range e of values, while preserving coupling at least two of the sensitive attributes to one another in the sets of attributes that are anonymized to provide a (k,e)-anonymized database. Related computer systems and methods are also provided.