
180 Park Ave - Building 103
Florham Park, NJ
http://www2.research.att.com/~lunadong/
Subject matter expert in Database, Information integration, Data cleaning, Web search, Personal information management
Xin Luna Dong is a researcher in the Data Management Department at AT&T Labs - Research. She received her Ph.D. in Computer Science and Engineering at Univ. of Washington. Before coming to the United States, she obtained a M.S. in Computer Science at Peking University, and a B.S. in Computer Science at Nankai University in China.
Chronos: Facilitating History Discovery by Linking Temporal Records
Divesh Srivastava, Xin Dong, Pei Li, Haidong Wang, Christina Tziviskou, Xiaoguang Liu, Andrea Maurino
VLDB,
2012.
[PDF]
[BIB]
VLDB Foundation Copyright
The definitive version was published in Very Large Databases, 2012. , 2012-08-27
{Many data sets contain temporal records over
a long period of time; each record is associated with
a time stamp and describes some aspects of a real-world
entity at that particular time. From such data,
users often wish to search for entities in a particular period
and understand the history of one entity or all entities in
the data set. A major challenge for enabling such search and exploration
is to identify records that describe the same real-world
entity over a long period of time; however, linking
temporal records is hard given that the values that
describe an entity can evolve over time (e.g., a person
can move from one affiliation to another).
We demonstrate the Chronos system
which offers users the useful tool for finding real-world
entities over time and understanding history of entities
in the bibliography domain. The core of Chronos
is a temporal record-linkage algorithm, which is tolerant
to value evolution over time. Our algorithm can obtain an F-measure of
over 0.9 in linking author records and fix errors made by DBLP.
We show how Chronos allows users to explore the history of authors,
and how it helps users understand our linkage results
by comparing our results with those of existing systems,
highlighting differences in the results,
explaining our decisions to users, and answering
``what-if" questions.}

We Challenge You to Certify Your Updates
Su Chen, Xin Dong, Divesh Srivastava, Laks V. S. Lakshmanan
ACM SIGMOD 2011,
2011.
[PDF]
[BIB]
ACM Copyright
(c) ACM, 2011. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in ACM SIGMOD 2011 , 2011-06-12.
{Correctness of data residing in a database is vital. While integrity constraint enforcement can often ensure data consistency, it is inadequate to protect against updates that involve careless, unintentional errors, e.g., whether a specified update to an employee�s record was for the intended employee. We propose a novel approach that is complementary to existing integrity enforcement techniques, to guard against such erroneous updates.
Our approach is based on (a) updaters providing an update certificate with each database update, and (b) the database system verifying the correctness of the update certificate provided before performing the update. We formalize a certificate as a (challenge, response) pair, and characterize good certificates as those that are easy for updaters to provide and, when correct, give the system enough confidence that the update was indeed intended. We present algorithms that efficiently enumerate good challenges, without exhaustively exploring the search space of all challenges. We experimentally demonstrate that (i) databases have many good challenges, (ii) these challenges can be efficiently identified, (iii) certificates can be quickly verified for correctness, (iv) under natural models of an updater�s knowledge of the database, update certificates catch a high percentage of the erroneous updates without imposing undue burden on the updaters performing correct updates, and (v) our techniques are robust across a wide range of challenge parameter settings.}

Online Data Fusion
Xuan Liu, Xin Dong, Beng Chin Ooi, Divesh Srivastava
VLDB Conference,
2011.
[PDF]
[BIB]
VLDB Foundation Copyright
The definitive version was published in Very Large Databases, 2011. , 2011-08-29
{The Web contains a significant volume of structured data in various
domains, but a lot of data are dirty and erroneous, and they can be
propagated through copying. While data integration techniques allow
querying structured data on the Web, they take the union of the
answers retrieved from different sources and can thus return conflicting
information. Data fusion techniques, on the other hand, aim
to find the true values, but are designed for offline data aggregation
and can take a long time.
This paper proposes the first online data fusion system. It starts
with returning answers from the first probed source, and refreshes
the answers as it probes more sources and applies fusion techniques
on the retrieved data. For each returned answer, it shows the likelihood
that the answer is correct, and stops retrieving data for it after
gaining enough confidence that data from the unprocessed sources
are unlikely to change the answer. We address key problems in
building such a system and show empirically that the system can
start returning correct answers quickly and terminate fast without
sacrificing the quality of the answers.}

Linking Temporal Records
Xin Dong, Divesh Srivastava, Pei Li, Andrea Maurio
VLDB Conference,
2011.
[BIB]
VLDB Foundation Copyright
The definitive version was published in Very Large Databases, 2011. , 2011-08-29
{Many data sets contain temporal records over a long period of time;
each record is associated with a time stamp and describes some aspects
of a real-world entity at that particular time (e.g., author information
in DBLP). In such cases, we often wish to identify records
that describe the same entity over time and so be able to enable interesting
longitudinal data analysis. However, existing record linkage
techniques ignore the temporal information and can fall short
for temporal data.
This paper studies linking temporal records. First, we apply time
decay to capture the effect of elapsed time on entity value evolution.
Second, instead of comparing each pair of records locally, we
propose clustering methods that consider time order of the records
and make global decisions. Experimental results show that our algorithms
significantly outperform traditional linkage methods on
various temporal data sets.}
Dependency Between Sources In Truth Discovery,
Tue May 29 16:10:34 EDT 2012
A method and system for truth discovery may implement a methodology that accounts for accuracy of sources and dependency between sources. The methodology may be based on Bayesian probability calculus for determining which data object values published by sources are likely to be true. The method may be recursive with respect to dependency, accuracy, and actual truth discovery for a plurality of sources.