
180 Park Ave - Building 103
Florham Park, NJ
Accurate and Efficient Private Release of Datacubes and Contingency Tables
Graham Cormode, Cecilia Procopiuc, Divesh Srivastava, Grigory Yaroslavtsev
IEEE International Conference on Data Engineering,
2013.
[PDF]
[BIB]
IEEE Copyright
This version of the work is reprinted here with permission of IEEE for your personal use. Not for redistribution. The definitive version was published in 2012 , 2013-04-08
{A central problem in releasing information is to do so accurately while providing a privacy guarantee on the output. Recently, there has been much work which guarantees differential privacy on the output, and aims to maximize the utility within this guarantee, for the class of {em linear queries}, which include basic counting queries, data cubes, and contingency tables. Much work follows a common template: pick a ``strategy'' set of linear queries to apply to the data, then use the noisy answers to these queries to reconstruct the queries of interest. This entails either picking a strategy set that is hoped to be good for the queries, or by performing an exhaustive search over the space of all possible strategies.
In this paper, we propose a new approach that balances accuracy and
efficiency: we show how to optimize the accuracy of a given strategy by
answering some strategy queries more accurately than others, based on the target queries. This leads to an efficient optimal noise allocation for many popular strategies, including wavelets, hierarchies, Fourier coefficients and more.
For the important case of marginal queries (equivalently, subsets of the data cube), we show that this strictly improves on previous methods, both analytically and empirically. Our results also extend to ensuring that the returned query answers are consistent with an (unknown) data set at minimal extra cost in
terms of time and noise.
}

Dense subgraph maintenance under streaming edge weight epdates for realtime story identification
Divesh Srivastava, Albert Angel, Nick Koudas, Nikos Sarkas
VLDB 2012,
2012.
[PDF]
[BIB]
VLDB Foundation Copyright
The definitive version was published in Very Large Databases, 2012. , Volume 5, Issue 6, 2012-08-15
{Recent years have witnessed an unprecedented proliferation of social
media. People around the globe author, every day, millions
of blog posts, micro-blog posts, social network status updates, etc.
This rich stream of information can be used to identify, on an ongoing
basis, emerging stories, and events that capture popular attention.
Stories can be identified via groups of tightly-coupled realworld
entities, namely the people, locations, products, etc., that are
involved in the story. The sheer scale, and rapid evolution of the
data involved necessitate highly efficient techniques for identifying
important stories at every point of time.
The main challenge in real-time story identification is the maintenance
of dense subgraphs (corresponding to groups of tightly coupled
entities) under streaming edge weight updates (resulting
from a stream of user-generated content). This is the first work
to study the efficient maintenance of dense subgraphs under such
streaming edge weight updates. For a wide range of definitions
of density, we derive theoretical results regarding the magnitude
of change that a single edge weight update can cause. Based on
these, we propose a novel algorithm, DYNDENS, which outperforms
adaptations of existing techniques to this setting, and yields
meaningful results. Our approach is validated by a thorough experimental
evaluation on large-scale real and synthetic datasets.}

Chronos: Facilitating History Discovery by Linking Temporal Records
Divesh Srivastava, Xin Dong, Pei Li, Haidong Wang, Christina Tziviskou, Xiaoguang Liu, Andrea Maurino
VLDB,
2012.
[PDF]
[BIB]
VLDB Foundation Copyright
The definitive version was published in Very Large Databases, 2012. , 2012-08-27
{Many data sets contain temporal records over
a long period of time; each record is associated with
a time stamp and describes some aspects of a real-world
entity at that particular time. From such data,
users often wish to search for entities in a particular period
and understand the history of one entity or all entities in
the data set. A major challenge for enabling such search and exploration
is to identify records that describe the same real-world
entity over a long period of time; however, linking
temporal records is hard given that the values that
describe an entity can evolve over time (e.g., a person
can move from one affiliation to another).
We demonstrate the Chronos system
which offers users the useful tool for finding real-world
entities over time and understanding history of entities
in the bibliography domain. The core of Chronos
is a temporal record-linkage algorithm, which is tolerant
to value evolution over time. Our algorithm can obtain an F-measure of
over 0.9 in linking author records and fix errors made by DBLP.
We show how Chronos allows users to explore the history of authors,
and how it helps users understand our linkage results
by comparing our results with those of existing systems,
highlighting differences in the results,
explaining our decisions to users, and answering
``what-if" questions.}

A Dataset Search Engine for the Research Document Corpus
Graham Cormode, Divesh Srivastava, Srinivas Bangalore, Marios Hadjieleftheriou
ICDE 2012,
2012.
[PDF]
[BIB]
IEEE Copyright
This version of the work is reprinted here with permission of IEEE for your personal use. Not for redistribution. The definitive version was published in ICDE 2012. , 2012-04-01
{A key step in validating a proposed idea or system
is to evaluate over a suitable data set. However, to this date there
have been no useful tools for researchers to understand which
datasets have been used for what purpose, or in what prior
work. Instead, they have to manually browse through papers
to find suitable datasets and their URLs, which is laborious and
inefficient. To better aid the data discovery process, and provide a
better understanding of how and where datasets have been used,
we propose a framework to effectively identify datasets within the
scientific corpus. The key technical challenges are identification
of datasets, and discovery of the association between a dataset
and the URLs where they can be accessed. Based on this, we
have built a user friendly web-based search interface for users
to conveniently explore the dataset-paper relationships, and find
relevant datasets and their properties.}

We Challenge You to Certify Your Updates
Su Chen, Xin Dong, Divesh Srivastava, Laks V. S. Lakshmanan
ACM SIGMOD 2011,
2011.
[PDF]
[BIB]
ACM Copyright
(c) ACM, 2011. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in ACM SIGMOD 2011 , 2011-06-12.
{Correctness of data residing in a database is vital. While integrity constraint enforcement can often ensure data consistency, it is inadequate to protect against updates that involve careless, unintentional errors, e.g., whether a specified update to an employee�s record was for the intended employee. We propose a novel approach that is complementary to existing integrity enforcement techniques, to guard against such erroneous updates.
Our approach is based on (a) updaters providing an update certificate with each database update, and (b) the database system verifying the correctness of the update certificate provided before performing the update. We formalize a certificate as a (challenge, response) pair, and characterize good certificates as those that are easy for updaters to provide and, when correct, give the system enough confidence that the update was indeed intended. We present algorithms that efficiently enumerate good challenges, without exhaustively exploring the search space of all challenges. We experimentally demonstrate that (i) databases have many good challenges, (ii) these challenges can be efficiently identified, (iii) certificates can be quickly verified for correctness, (iv) under natural models of an updater�s knowledge of the database, update certificates catch a high percentage of the erroneous updates without imposing undue burden on the updaters performing correct updates, and (v) our techniques are robust across a wide range of challenge parameter settings.}

Summary Graphs for Relational Database Schemas
Cecilia Procopiuc, Divesh Srivastava, Xiaoyan Yang
VLDB Conference,
2011.
[PDF]
[BIB]
VLDB Foundation Copyright
The definitive version was published in Very Large Databases, 2011. , 2011-08-29
{Increasingly complex databases need ever more sophisticated tools
to help users understand their schemas and interact with the data.
Existing tools fall short of either providing the ``big picture,''
or of presenting useful connectivity information.
In this paper we define summary graphs, a novel approach for
summarizing schemas. Given a set of user-specified query tables,
the summary graph automatically computes the most relevant tables
and joins for that query set. The output preserves the most
informative join paths between the query tables, while meeting
size constraints. In the process, we define a novel
information-theoretic measure over join edges. Unlike most
subgraph extraction work, we allow metaedges (i.e., edges in the
transitive closure) to help reduce output complexity. We prove
that the problem is NP-Hard, and solve it as an integer program.
Our extensive experimental study shows that our method returns
high-quality summaries under independent quality measures.}

Online Data Fusion
Xuan Liu, Xin Dong, Beng Chin Ooi, Divesh Srivastava
VLDB Conference,
2011.
[PDF]
[BIB]
VLDB Foundation Copyright
The definitive version was published in Very Large Databases, 2011. , 2011-08-29
{The Web contains a significant volume of structured data in various
domains, but a lot of data are dirty and erroneous, and they can be
propagated through copying. While data integration techniques allow
querying structured data on the Web, they take the union of the
answers retrieved from different sources and can thus return conflicting
information. Data fusion techniques, on the other hand, aim
to find the true values, but are designed for offline data aggregation
and can take a long time.
This paper proposes the first online data fusion system. It starts
with returning answers from the first probed source, and refreshes
the answers as it probes more sources and applies fusion techniques
on the retrieved data. For each returned answer, it shows the likelihood
that the answer is correct, and stops retrieving data for it after
gaining enough confidence that data from the unprocessed sources
are unlikely to change the answer. We address key problems in
building such a system and show empirically that the system can
start returning correct answers quickly and terminate fast without
sacrificing the quality of the answers.}

On Query Result Diversification
Marios Hadjieleftheriou, Divesh Srivastava, Marcos Vieira, Humberto Razente, Maria Barioni, Caetano Traina Jr., Vassilis J. Tsotras
IEEE ICDE 2011,
2011.
[BIB]
{In this paper we describe a general framework for
evaluation and optimization of methods for diversifying query
results. In these methods, an initial ranking answer set produced
by a query is used to construct a result set, where elements
are ranked with respect to relevance and diversity features, i.e.,
the retrieved elements should be as relevant as possible to the
query, and, at the same time, the result set should be as diverse
as possible. While addressing relevance is relatively simple, and
has been heavily studied, diversity is a harder problem to solve.
One major contribution of this paper is that, using the above
framework, we adapt, implement and evaluate several existing
methods for diversifying query results. We also propose two
new approaches, namely the Greedy with Marginal Contribution
(GMC) and the Greedy Randomized with Neighborhood Expansion
(GNE) methods. Both methods iteratively construct a result
set using a scoring function that ranks candidate elements using
not only relevance and diversity to the existing result set, but also
accounts for diversity against the remaining candidates. Another
major contribution of this paper is that we present the first
thorough experimental evaluation of the various diversification
techniques implemented in a common framework. We examine
the methods� performance with respect to precision, running
time and quality of the result. Our experimental results show
that while the proposed methods have higher running times,
they achieve precision very close to the optimal, while also
providing the best result quality.While GMC is deterministic, the
randomized approach (GNE) can achieve better result quality if
the user is willing to tradeoff running time.}

Linking Temporal Records
Xin Dong, Divesh Srivastava, Pei Li, Andrea Maurio
VLDB Conference,
2011.
[BIB]
VLDB Foundation Copyright
The definitive version was published in Very Large Databases, 2011. , 2011-08-29
{Many data sets contain temporal records over a long period of time;
each record is associated with a time stamp and describes some aspects
of a real-world entity at that particular time (e.g., author information
in DBLP). In such cases, we often wish to identify records
that describe the same entity over time and so be able to enable interesting
longitudinal data analysis. However, existing record linkage
techniques ignore the temporal information and can fall short
for temporal data.
This paper studies linking temporal records. First, we apply time
decay to capture the effect of elapsed time on entity value evolution.
Second, instead of comparing each pair of records locally, we
propose clustering methods that consider time order of the records
and make global decisions. Experimental results show that our algorithms
significantly outperform traditional linkage methods on
various temporal data sets.}

DivDB: A System for Diversifying Query Results
Marcos R. Vieira, Humberto L. Razente, Maria C. N. Barioni, Marios Hadjieleftheriou, Divesh Srivastava, Caetano Traina Jr., Vassilis J. Tsotras
VLDB Conference,
2011.
[PDF]
[BIB]
VLDB Foundation Copyright
The definitive version was published in Very Large Databases, 2011. , 2011-08-29
{With the availability of very large databases, an exploratory
query can easily lead to a vast answer set, typically based on
an answer�s relevance (i.e., top-k, tf-idf ) to the user query.
Navigating through such an answer set requires huge effort
and users give up after perusing through the first few answers,
thus some interesting answers hidden further down the
answer set can easily be missed. An approach to address
this problem is to present the user with the most diverse
among the answers based on some diversity criterion.
In this demonstration we present DivDB, a system we built
to provide query result diversification both for advanced and
novice users. For the experienced users, who may want to
test the performance of existing and new algorithms, we
provide an SQL-based extension to formulate queries with
diversification. As for the novice users, who may be more
interested in the result rather than how to tune the various
algorithms� parameters, the DivDB system allows the user
to provide a �hint� to the optimizer on speed vs. quality of
result. Moreover, novice users can use an interface to dy-
namically change the tradeoff value between relevance and
diversity in the result, and thus visually inspect the result as
they interact with this parameter. This is a great feature to
the end user because finding a good tradeoff value is a very
hard task and it depends on several variables (i.e., query
parameters, evaluation algorithms, and dataset properties).
In this demonstration we show a study of the DivDB system
with two image databases that contain many images of the
same object under different settings (e.g., different camera
angle). We show how the DivDB helps users to iteratively
inspect diversification in the query result, without the need
to know how to tune the many different parameters of the
several existing algorithms in the DivDB system.}

Discovery of Complex Glitch Patterns: A Novel Approach to Quantitative Data Cleaning
Laure Berti-Equille, Tamraparni Dasu, Divesh Srivastava
IEEE International Conference on Data Engineering (ICDE),
2011.
[BIB]
{Data quality mining (DQM) is the use of data mining techniques to
detect, quantify, explain, and correct data quality problems. Current
DQM approaches focus on addressing each category of data
glitch separately. However, in real-life data, different types of data
glitches co-occur in complex patterns. These patterns and interactions
between glitches offer valuable clues for developing effective
data cleaning strategies that are more informed than blind, predefined
strategies.
In this paper, we propose a novel data quality mining framework
for the comprehensive definition, detection and cleaning of complex,
multi-type data glitches. We exploit the distributions and interaction
of different types of glitches to develop data-driven cleaning
strategies that offer significant advantages over blind strategies.
We develop a statistically rigorous framework for glitch scoring,
strategy evaluation and the selection of an optimal strategy from
the space of candidate quantitative cleaning strategies. We demonstrate
the efficacy and scalability of our framework on very large
real and synthetic data sets.}

Bistro Data Feed Management System
Vladislav Shkapenyuk, Divesh Srivastava, Theodore Johnson
ACM SIGMOD Conference,
2011.
[PDF]
[BIB]
ACM Copyright
(c) ACM, 2011. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in ACM SIGMOD Conference , 2011-06-12.
{Data feed management is a critical component of many data intensive applications that depend on reliable data delivery to support real-time data collection, correlation and analysis. Data is typically collected from a wide variety of sources and organizations, using a range of mechanisms - some data are streamed in real time, while other data are obtained at regular intervals or collected in an ad hoc fashion. Individual applications are forced to make separate arrangements with feed providers, learn the structure of incoming files, monitor data quality, and trigger any processing necessary. The Bistro data feed manager, designed and implemented at AT&T Labs- Research, simplifies and automates this complex task of data feed management: efficiently handling incoming raw files, identifying data feeds and distributing them to remote subscribers.
Bistro supports a flexible specification language to define logical data feeds using the naming structure of physical data files, and to identify feed subscribers. Based on the specification, Bistro matches data files to feeds, performs file normalization and compression, efficiently delivers files, and notifies subscribers using a trigger mechanism. We describe our feed analyzer that discovers the naming structure of incoming data files to detect new feeds, dropped feeds, feed changes, or lost data in an existing feed. Bistro is currently deployed within AT&T Labs and is responsible for the real-time delivery of over 100 different raw feeds, distributing data to several large-scale stream warehouses.
}

Automatic Discovery of Attributes in Relational Databases
Meihui Zhang, Beng Chin Ooi, Divesh Srivastava, Cecilia Procopiuc, Marios Hadjieleftheriou
ACM SIGMOD 2011,
2011.
[PDF]
[BIB]
ACM Copyright
(c) ACM, 2011. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in ACM SIGMOD 2011 , 2011-06-12, http://www.acm.org/.
{In this work we try to cluster relational columns into attributes, i.e., to identify strong relationships between columns
based on the common properties and characteristics of the values they contain.
For example, identifying whether a certain set of columns refers
to telephone numbers versus social security numbers, or names of customers versus names of employees.
Traditional relational database schema languages use very limited primitive data types and
simple foreign key constraints to express relationships between columns. Object oriented schema
languages allow the definition of custom data types, but still, certain relationships between
columns might be unknown at design time or they might appear only in a particular database instance.
Nevertheless, these relationships are an invaluable tool for schema matching, and generally for
better understanding and working with the data. Here, we introduce data oriented solutions (we do not consider solutions that
assume the existence of any external knowledge), that use statistical measures to identify strong relationships
between the values of a set of columns. Interpreting the
database as a graph where nodes correspond to database columns and edges correspond to column
relationships, we decompose the graph into connected components and cluster sets of columns into
attributes. To test the quality of our solutions, we also provide a
comprehensive experimental evaluation using real and synthetic datasets.
}

Data Auditor: Exploring Data Quality and Semantics using Pattern Tableaux
Lukasz Golab, Howard Karloff, Philip Korn, Divesh Srivastava
2010.
[PDF]
[BIB]
{We present Data Auditor, a tool for exploring data quality and semantics. Data Auditor provides a variety of semantic rule (constraint) classes for asserting consistency and completeness properties of the data. It can be used to test hypotheses by instantiating rules over supplied sets of attributes, to see if such rules hold or fail (to a required degree) on a minimum fraction of the data. Regions of the data for which the rule is satisfied (alternatively, violated) are concisely and meaningfully summarized in a tableau. }
CPM: Adaptive VoD with Cooperative Peer Assist and Multicast
Kadangode Ramakrishnan, Vijay Gopalakrishnan, Rittwik Jana, Divesh Srivastava, Bobby Bhattacharjee
2008.
[PDF]
[BIB]
{We present CPM, a unified approach that exploits server multicast, assisted by peer downloads, to provide efficient video-on-demand (VoD) in a service provider environment. We describe our architecture and show how CPM is designed to dynamically adapt to a wide range of situations including highly different peer-upload bandwidths, content popularity, user request arrival patterns (including flash-crowds), video library size, and subscriber population. We demonstrate the effectiveness of CPM using simulations (based on the an actual implementation codebase) across the range of situations described above and show that CPM does significantly better than traditional unicast, different forms of multicast, as well as peer-to-peer schemes. Along with synthetic parameters, we augment our experiments using data from a deployed VoD service to evaluate the performance of CPM. Original document was a presentation. This is a paper we have written, and submitted to a conference. The first author has also been changed. }

Multicast architecture with adaptive dual-state
Kadangode Ramakrishnan, Divesh Srivastava, Michael Rabinovich, Yin Zhang, Tae Won Cho
2007.
[PDF]
[BIB]
{Multicast is an approach that uses network and server resources efficiently to distribute information to groups. As networks evolve to become information-centric, users will increasingly demand publish-subscribe based access to fine-grained information, and multicast will need to evolve to (i) manage an increasing number of groups, with a distinct group for each piece of distributable content; (ii) support persistent group membership, as group activity can vary over time, with intense activity at some times, and infrequent (but still critical) activity at others. These requirements raise scalability challenges that are not met by today's multicast techniques. }
Adaptive Processing Of Top-K Queries In Nested Structure Arbitrary Markup Language Such As XML,
Tue Apr 02 17:25:42 EDT 2013
A method of adaptively evaluating a top-k query involves (1204) forming a servers having respective server queues storing candidate answers, processing (1322) the candidate answers, and (1232) providing a top-k set as a query evaluation. Processing includes (1402) adaptively choosing a winning server to whose queue a current candidate answer should be sent; (1404) sending the current candidate answer to the winning server's queue; (1334) adaptively choosing a next candidate answer to process from the winning server's queue; (1336) computing a join between the current candidate answer and next candidate answers at the winning server, so as to produce a new current candidate answer; and (1338) updating the top-k set with the new current candidate answer only if a score of the new current candidate answer exceeds a score of a top-k answer in a top-k set. A method of calculating scores for candidate answers is also provided.
Method And Apparatus For Using Tag Topology,
Tue Nov 20 16:12:25 EST 2012
A method and apparatus for using tag topology for enhancing search capabilities, e.g., searching over the web, are disclosed. For example, the present method receives a user query contain a search term from a user. The method then generates a search result containing at least one entity, wherein the at least one entity is found based on a plurality of user provided tags that is associated with the at least one entity.
Methods And Systems To Store State Used To Forward Multicast Traffic,
Tue Oct 23 16:12:06 EDT 2012
Methods and systems are described to store state used to forward multicast traffic. The system includes a receiving module to receive request to add a first node to a membership tree. The membership tree includes a first plurality of nodes associated with a multicast group. The system further includes a processing module to identify a second node in the first plurality of nodes and to communicate a node identifier that identifies the first node over a network to the second node. The node identifier is to be stored at the second node to add the first node to the membership tree. The node identifier is further to be stored in the membership tree exclusively at the second node to enable the second node to forward the multicast traffic to the first node.
Systems And Associated Computer Program Products That Disguise Partitioned Data Structures Using Transformations Having Targeted Distributions,
Tue Jun 26 16:10:52 EDT 2012
A data structure that includes at least one partition containing non-confidential quasi-identifier microdata and at least one other partition containing confidential microdata is formed. The partitioned confidential microdata is disguised by transforming the confidential microdata to conform to a target distribution. The disguised confidential microdata and the quasi-identifier microdata are combined to generate a disguised data structure. The disguised data structure is used to carry out statistical analysis and to respond to a statistical query is directed to the use of confidential microdata. In this manner, the privacy of the confidential microdata is preserved.
Method And Apparatus For Rapid Identification Of Column Heterogenity,
Tue May 08 16:10:22 EDT 2012
A method and apparatus for rapid identification of column heterogeneity in databases are disclosed. For example, the method receives data associated with a column in a database. The method computes a cluster entropy for the data as a measure of data heterogeneity and then determines whether said data is heterogeneous in accordance with the cluster entropy.
Join Paths Across Multiple Databases,
Tue May 08 16:10:21 EDT 2012
Methods, systems and computer instructions on computer readable media are disclosed for optimizing a query, including a first join path, a second join path, and an optimizer, to efficiently provide high quality information from large, multiple databases. The methods and systems include evaluating a schema graph identifying the join paths between a field X and a field Y, and a value X=x, to identify the top-few values of Y=y that are reachable from a specified X=x value when using the join paths. Each data path that instantiates the schema join paths can be scored and evaluated as to the quality of the data with respect to specified integrity constraints to alleviate data quality problems. Agglomerative scoring methodologies can be implemented to compute high quality information in the form of a top-few answers to a specified problem as requested by the query.
Database Analysis Using Clusters,
Tue Apr 17 16:09:59 EDT 2012
A method for mapping relationships in a database results in a cluster graph. A representative sample of records in each of a plurality of tables in the database is analyzed for nearest neighbor join edges instantiated by the record. Records with corresponding nearest neighbor join edges are grouped into clusters. Cluster pairs which share a join relationship between two tables are identified. A weighting may be applied to cluster pairs based on the number of records for the cluster pair. Meaningful cluster pairs above a weighted threshold may be ordered according to table and displayed as a cluster graph. Analyses of the cluster graph may reveal important characteristics of the database.
Method And Apparatus For Associating Metadata With Data,
Tue Mar 27 16:09:38 EDT 2012
Method and apparatus for associating at least one query expression to an original database table is described. In one example, a metadata table is added to a database, wherein at least one portion of the metadata table comprises the at least one query expression. Afterwards, the at least one query expression is associated to at least one value from at least one tuple belonging to a data table of the database.
System And Method For Distributed Content Transformation,
Tue Mar 27 16:09:37 EDT 2012
A distributed transformation network provides delivery of content from a content publisher to a content recipient. Content from the content publisher is received at an entry node of the distributed transformation network and transmitted to a transformation node in the distributed transformation network. The content is transformed according to publisher, recipient or network administrator specifications and transmitting to delivery nodes which deliver the transformed content to the content recipient. The published content may be in an XML-based format and transformed into an XML-related format or any other structured language format as desired in the provided specification.
Method And System For Pattern Matching Having Holistic Twig Joins,
Tue Feb 14 16:09:21 EST 2012
A method of query pattern matching uses a chain of linked stacks to compactly represent partial results to root-to-leaf query paths, which are then composed to obtain matches for the twig pattern.
Multiple Aggregations Over Data Streams,
Tue Feb 14 16:09:20 EST 2012
A system for a data stream management system includes a filter transport aggregate for a high speed input data stream with a plurality of packets each packet comprising attributes. The system includes an evaluation system to evaluate the high speed input data stream and partitions the packets into groups the attributes and a table, wherein the table stores the attributes of each packets using a hash function. A phantom query is used to define partitioned groups of packets using attributes other than those used to group the packets for solving user queries without performing the user queries on the high speed input data stream.
Method Of Pattern Searching,
Tue Feb 14 16:09:20 EST 2012
Structural join mechanisms provide efficient query pattern matching. In one embodiment, tree-merge mechanisms are provided. In another embodiment, stack-tree mechanisms are provided.
System And Method For Generating Statistical Descriptors For A Data Stream,
Tue Jan 24 16:09:05 EST 2012
Described is a system and method for receiving a data stream of multi-dimensional items, collecting a sample of the data stream having a predetermined number of items and dividing the sample into a plurality of subsamples, each subsample corresponding to a single dimension of each of the predetermined number of items. A query is then executed on a particular item in at least two of the subsamples to generate data for the corresponding subsample. This data is combined into a single value.
Method And Apparatus For Ranked Join Indices,
Tue Sep 20 16:02:15 EDT 2011
A method and apparatus for ranked join indices includes a solution providing performance guarantees for top-k join queries over two relations, when preprocessing to construct a ranked join index for a specific join condition is permitted. The concepts of ranking join indices presented herein are also applicable in the case of a single relation. In this case, the concepts herein provide a solution to the top-k selection problem with monotone linear functions, having guaranteed worst case search performance for the case of two ranked attributes and arbitrary preference vectors.
Method Of Pattern Searching,
Tue Sep 06 16:02:14 EDT 2011
Structural join mechanisms provide efficient query pattern matching. In one embodiment, tree-merge mechanisms are provided. In another embodiment, stack-tree mechanisms are provided.
System And Method For Providing Structure And Content Scoring For XML,
Tue Aug 23 16:02:13 EDT 2011
A system, method and computer readable medium are disclosed. The method embodiment relates to a method of computing score of candidate answers to a database query. The method comprises receiving a database query, assigning a first score to a match to the query, the first score being associated with a relative importance of an individual keyword in a collection of documents based on all structural and content predicates in the query, assigning a second score to the match, the second score being associated with a relative importance of a keyword in an individual document and using the assigned first score and second score to compute an answer score for the query.
Routing XML Queries,
Tue Aug 16 16:02:11 EDT 2011
A vast amount of information currently accessible over the Web, and in corporate networks, is stored in a variety of databases, and is being exported as XML data. However, querying this totality of information in a declarative and timely fashion is problematic because this set of databases is dynamic, and a common schema is difficult to maintain. The present invention provides a solution to the problem of issuing declarative, ad hoc XPath queries against such a dynamic collection of XML databases, and receiving timely answers. There is proposed a decentralized architectures, under the open and the agreement cooperation models between a set of sites, for processing queries and updates to XML data. Each site consists of XML data nodes. (which export their data as XML, and also pose queries) and one XML router node (which manages the query and update interactions between sites). The architectures differ in the degree of knowledge individual router nodes have about data nodes containing specific XML data. There is therefore provided a method for accessing data over a wide area network comprising: providing a decentralized architecture comprising a plurality of data nodes each having a database, a query processor and a path index, and a plurality of router nodes each having a routing state, maintaining a routing state in each of the router nodes, broadcasting routing state updates from each of the databases to the router nodes, routing path queries to each of the databases by accessing the routing state.
Meta-Data Indexing For XPath Location Steps,
Tue Jul 12 16:02:11 EDT 2011
In accordance with a method of encoding meta-data associated with tree-structured data, a first set of elements of a plurality of elements in the tree-structured is associated explicitly with explicit meta-data levels, and a second set of elements of the plurality of elements is associated by inheritance with explicit meta-data levels of closest ancestor elements of the first set of elements. The plurality of elements is packed into a plurality of leaf nodes of an index structure. The plurality of leaf nodes is merged into a plurality of non-leaf nodes until a root non-leaf node is generated. The plurality of non-leaf nodes of the index structure is associated with indicators representing ranges of the explicit meta-data levels in the packed first set of elements, such that explicit meta-data level ranges of descendant non-leaf nodes are subsets of explicit meta-data level ranges of ancestor non-leaf nodes.
Set Similarity Selection Queries At Interactive Speeds,
Tue Apr 05 16:02:01 EDT 2011
The similarity between a query set comprising query set tokens and a database set comprising database set tokens is determined by a similarity score. The database sets belong to a data collection set, which contains all database sets from which information may be retrieved. If the similarity score is greater than or equal to a user-defined threshold, the database set has information relevant to the query set. The similarity score is calculated with an inverse document frequency method (IDF) similarity measure independent of term frequency. The document frequency is based at least in part on the number of database sets in the data collection set and the number of database sets which contain at least one query set token. The length of the query set and the length of the database set are normalized.
Method And System For Performing Queries On Data Streams,
Tue Mar 08 16:01:53 EST 2011
A method and system for performing a data stream query. A data stream query requiring a join operation on multiple data streams is approximated without performing the join operation. It is determined whether conditions of the query are proper to accurately approximate the join operation, and if the conditions are proper the join operation is approximated. The join operation is approximated by independently aggregating values of the data streams and comparing the independently aggregated values.
Method And Apparatus For Optimizing Queries Under Parametric Aggregation Constraints,
Tue Mar 08 16:01:53 EST 2011
The present invention relates to a method and apparatus for optimizing queries. The present invention discloses an efficient method for providing answers to queries under parametric aggregation constraints.
Meta-Data Indexing For XPath Location Steps,
Tue Dec 07 15:50:49 EST 2010
Techniques are disclosed that efficiently support the querying of meta-data in XML documents. The techniques include efficiently identifying XML elements along each location step in an XPath query that satisfy range constraints on ordered meta-data. The techniques include generating an inheritance meta-data index in which actual meta-data levels are associated only with elements for which a value is explicitly specified and associating non-leaf nodes of the index structure with inherited meta-data levels and inheritance source nodes. The techniques may be used with navigation-based and join-based XPath evaluation strategies.
Method And Apparatus For Packet Analysis In A Network,
Tue Nov 09 15:50:00 EST 2010
A method and system for monitoring traffic in a data communication network and for extracting useful statistics and information is disclosed.
Method And Apparatus For Optimizing Queries Under Parametric Aggregation Constraints,
Tue Feb 23 15:50:23 EST 2010
The present invention relates to a method and apparatus for optimizing queries. The present invention discloses an efficient method for providing answers to queries under parametric aggregation constraints.
Routing XML Queries,
Tue Feb 16 15:50:22 EST 2010
A vast amount of information currently accessible over the Web, and in corporate networks, is stored in a variety of databases, and is being exported as XML data. However, querying this totality of information in a declarative and timely fashion is problematic because this set of databases is dynamic, and a common schema is difficult to maintain. The present invention provides a solution to the problem of issuing declarative, ad hoc XPath queries against such a dynamic collection of XML databases, and receiving timely answers. There is proposed a decentralized architectures, under the open and the agreement cooperation models between a set of sites, for processing queries and updates to XML data. Each site consists of XML data nodes. (which export their data as XML, and also pose queries) and one XML router node (which manages the query and update interactions between sites). The architectures differ in the degree of knowledge individual router nodes have about data nodes containing specific XML data. There is therefore provided a method for accessing data over a wide area network comprising: providing a decentralized architecture comprising a plurality of data nodes each having a database, a query processor and a path index, and a plurality of router nodes each having a routing state, maintaining a routing state in each of the router nodes, broadcasting routing state updates from each of the databases to the router nodes, routing path queries to each of the databases by accessing the routing state.
Method And Apparatus For Ranked Join Indices,
Tue Feb 16 15:50:21 EST 2010
A method and apparatus for ranked join indices includes a solution providing performance guarantees for top-k join queries over two relations, when preprocessing to construct a ranked join index for a specific join condition is permitted. The concepts of ranking join indices presented herein are also applicable in the case of a single relation. In this case, the concepts herein provide a solution to the top-k selection problem with monotone linear functions, having guaranteed worst case search performance for the case of two ranked attributes and arbitrary preference vectors.
System And Method For Generating Statistical Descriptors For A Data Stream,
Tue Feb 02 15:50:19 EST 2010
Described is a system and method for receiving a data stream of multi-dimensional items, collecting a sample of the data stream having a predetermined number of items and dividing the sample into a plurality of subsamples, each subsample corresponding to a single dimension of each of the predetermined number of items. A query is then executed on a particular item in at least two of the subsamples to generate data for the corresponding subsample. This data is combined into a single value.
Join Paths Across Multiple Databases,
Tue Dec 08 15:38:00 EST 2009
Methods, systems and computer instructions on computer readable media are disclosed for optimizing a query, including a first join path, a second join path, and an optimizer, to efficiently provide high quality information from large, multiple databases. The methods and systems include evaluating a schema graph identifying the join paths between a field X and a field Y, and a value X=x, to identify the top-few values of Y=y that are reachable from a specified X=x value when using the join paths. Each data path that instantiates the schema join paths can be scored and evaluated as to the quality of the data with respect to specified integrity constraints to alleviate data quality problems. Agglomerative scoring methodologies can be implemented to compute high quality information in the form of a top-few answers to a specified problem as requested by the query.
Multiple Aggregations Over Data Streams,
Tue Dec 08 15:38:00 EST 2009
A system for a data stream management system includes a filter transport aggregate for a high speed input data stream with a plurality of packets each packet comprising attributes. The system includes an evaluation system to evaluate the high speed input data stream and partitions the packets into groups the attributes and a table, wherein the table stores the attributes of each packets using a hash function. A phantom query is used to define partitioned groups of packets using attributes other than those used to group the packets for solving user queries without performing the user queries on the high speed input data stream.
Method And Systems For Content Access And Distribution,
Tue Nov 24 15:38:50 EST 2009
Distribution of content between publishers and consumers is accomplished using an overlay network that may make use of XML language to facilitate content identification. The overlay network includes a plurality of routers that may be in communication with each other and the publishers and consumers on the Internet. Content and queries are identified by content descriptors that are routed from the originator to a nearest router in the overlay network. The nearest router, for each unique content descriptor, generates a hash identification of the content descriptor which is used by remaining routers in the overlay network to provide the appropriate functions with respect to the content descriptor. In particular, this allows all routers in the overlay network except the nearest router to properly route content without processing every content descriptor.
System and Method For Identifying Hierarchical Heavy Hitters In A Multidimensional Environment,
Tue Sep 15 15:38:46 EDT 2009
A method including receiving a plurality of elements of a data stream, storing a multi-dimensional data structure in a memory, said multi-dimensional data structure storing the plurality of elements as a hierarchy of nodes, each node having a frequency count corresponding to the number of elements stored therein, comparing the frequency count of each node to a threshold value based on a total number of the elements stored in the nodes and identifying each node for which the frequency count is at least as great as the threshold value as a hierarchical heavy hitter (HHH) node and propagating the frequency count of each non-HHH nodes to its corresponding parent nodes.
Method For Using Query Templates In Directory Caches,
Tue Apr 21 15:38:38 EDT 2009
The present invention discloses the use of generalized queries, referred to as query templates, obtained by generalizing individual user queries, as the semantic basis for low overhead, high benefit directory caches for handling declarative queries. Caching effectiveness can be improved by maintaining a set of generalizations of queries and admitting such generalizations into the cache when their estimated benefits are sufficiently high. In a preferred embodiment of the invention, the admission of query templates into the cache can be done in what is referred to by the inventors as a "revolutionary" fashion--followed by stable periods where cache admission and replacement can be done incrementally in an evolutionary fashion. The present invention can lead to considerably higher hit rates and lower server-side execution and communication costs than conventional caching of directory queries--while keeping the clientside computational overheads comparable to query caching.
Method And System For Pattern Matching Having Holistic Twig Joins,
Tue Jan 27 15:38:35 EST 2009
A method of query pattern matching uses a chain of linked stacks to compactly represent partial results to root-to-leaf query paths, which are then composed to obtain matches for the twig pattern.
Method and system for pattern matching having holistic twig joins,
Tue Nov 18 18:13:16 EST 2008
A method of query pattern matching uses a chain of linked stacks to compactly represent partial results to root-to-leaf query paths, which are then composed to obtain matches for the twig pattern.
Method and apparatus for packet analysis in a network,
Tue Nov 11 18:13:14 EST 2008
A method and system for monitoring traffic in a data communication network and for extracting useful statistics and information is disclosed. In accordance with an embodiment of the invention, a network interface card has a run-time system and one or more processing blocks executing on the network interface. The run-time system module feeds information derived from a network packet to the processing modules which process the information and generate output such as condensed statistics about the packets traveling through the network.
Method of pattern searching,
Tue Nov 11 18:13:14 EST 2008
Structural join mechanisms provide efficient query pattern matching. In one embodiment, tree-merge mechanisms are provided. In another embodiment, stack-tree mechanisms are provided.
Method of performing approximate substring indexing,
Tue Oct 28 18:13:06 EDT 2008
Approximate substring indexing is accomplished by decomposing each string in a database into overlapping positional q-grams, sequences of a predetermined length q, and containing information regarding the position of each q-gram within the string (i.e., 1.sup.st q-gram, 4.sup.th q-gram, etc.). An index is then formed of the tuples of the positional q-gram data (such as, for example, a B-tree index or a hash index). Each query applied to the database is similarly parsed into a plurality of positional q-grams (of the same length), and a candidate set of matches is found. Position-directed filtering is used to remove the candidates which have the q-grams in the wrong order and/or too far apart to form a verified output of matching candidates. If errors are permitted (defined in terms of an edit distance between each candidate and the query), an edit distance calculation can then be performed to produce the final set of matching strings.
Phrase matching in documents having nested-structure arbitrary (document-specific) markup,
Tue Apr 08 18:12:43 EDT 2008
A method of searching a document having nested-structure document-specific markup (such as Extensible Markup Language (XML)) involves 112 receiving a query that designates at least (A) a phrase to be matched in a phrase matching process, and (B) a selective designation of at least a tag or annotation that is to be ignored during the phrase matching process. The method further involves 114 deriving query-specific indices based on query-independent indices that were created specific to each document, and 116 carrying out the phrase matching process using the query-specific indices on the document having the nested-structure document-specific markup.
Method and system for pattern matching having holistic twig joins,
Tue May 15 18:12:02 EDT 2007
A method of query pattern matching uses a chain of linked stacks to compactly represent partial results to root-to-leaf query paths, which are then composed to obtain matches for the twig pattern.
Method and apparatus for ranked join indices,
Tue Feb 27 18:11:56 EST 2007
A method and apparatus for ranked join indices includes a solution providing performance guarantees for top-k join queries over two relations, when preprocessing to construct a ranked join index for a specific join condition is permitted. The concepts of ranking join indices presented herein are also applicable in the case of a single relation. In this case, the concepts herein provide a solution to the top-k selection problem with monotone linear functions, having guaranteed worst case search performance for the case of two ranked attributes and arbitrary preference vectors.
Method and apparatus for packet analysis in a network,
Tue Jan 16 18:11:49 EST 2007
A method and system for monitoring traffic in a data communication network and for extracting useful statistics and information is disclosed. In accordance with an embodiment of the invention, a network interface card has a run-time system and one or more processing blocks executing on the network interface. The run-time system module feeds information derived from a network packet to the processing modules which process the information and generate output such as condensed statistics about the packets traveling through the network.
Method of performing approximate substring indexing,
Tue Mar 07 18:11:01 EST 2006
Approximate substring indexing is accomplished by decomposing each string in a database into overlapping positional q-grams, sequences of a predetermined length q, and containing information regarding the position of each q-gram within the string (i.e., 1.sup.st q-gram, 4.sup.th q-gram, etc.). An index is then formed of the tuples of the positional q-gram data (such as, for example, a B-tree index or a hash index). Each query applied to the database is similarly parsed into a plurality of positional q-grams (of the same length), and a candidate set of matches is found. Position-directed filtering is used to remove the candidates which have the q-grams in the wrong order and/or too far apart to form a verified output of matching candidates. If errors are permitted (defined in terms of an edit distance between each candidate and the query), an edit distance calculation can then be performed to produce the final set of matching strings.
Distributed evalulation of directory queries using a topology cache,
Tue Dec 27 18:10:44 EST 2005
A technique for performing query evaluation on distributed directories utilizes the creation of a topology cache defining the hierarchical relationship between the various directory servers (i.e., identifying subordinate and superior knowledge references associated with each directory server in the system). The created topology cache is then stored at each directory server, and forwarded to a client upon submitting a query to the system. Using the topology cache information at the client, a distributed query evaluation plan can be developed for use with complex queries, such as hierarchical and aggregate queries.
Method For Using Query Templates In Directory Caches,
Tue Jun 07 18:10:23 EDT 2005
The present invention discloses the use of generalized queries, referred to as query templates, obtained by generalizing individual user queries, as the semantic basis for low overhead, high benefit directory caches for handling declarative queries. Caching effectiveness can be improved by maintaining a set of generalizations of queries and admitting such generalizations into the cache when their estimated benefits are sufficiently high. In a preferred embodiment of the invention, the admission of query templates into the cache can be done in what is referred to by the inventors as a revolutionary fashion—followed by stable periods where cache admission and replacement can be done incrementally in an evolutionary fashion. The present invention can lead to considerably higher hit rates and lower server-side execution and communication costs than conventional caching of directory queries—while keeping the clientside computational overheads comparable to query caching.
Method For Effective Indexing Of Partially Dynamic Documents,
Tue Aug 12 18:08:49 EDT 2003
A method more efficiently indexes dynamic documents. The method adjusts the frequency with which dynamic documents are retrieved taking into account the extent to which the document varies between its most recent retrievals. Furthermore, the method selects portions of the document to be indexed based on the substance of the differences between recently retrieved copies.
Messaging system with application-defined states,
Tue Aug 27 18:08:25 EDT 2002
A messaging system in which a core messaging infrastructure stores and manages messaging attributes, but applications external to the core infrastructure define and modify most attributes. Attribute types may be easily defined or modified, the manner in which attribute values are obtained may be easily defined or modified, and the entity types to which attributes are assigned may be easily defined or modified. The messaging system includes a plurality of messaging entities, such as messages, folders, and users, a plurality of attributes associated with the messaging entities, and a plurality of applications. Each application is operable to examine and modify at least some of the messaging entities and attributes. An application selection device is operable to examine at least some of the messaging entities and at least some of the attributes and to select an application to be invoked, from among the plurality of applications, based on values of the examined messaging entities and attributes. An application invocation device invokes the selected application. The applications may define and modify a type of an attribute and/or may define and modify an association of an attribute with a messaging entity.
Declarative message addressing,
Tue Aug 20 18:08:25 EDT 2002
A messaging system, and method of operation thereof, which supports combinations of directory and mailing list addressing mechanisms. Intended message recipients are specified as declarative addresses, which may include combinations of directory and mailing list information. The messaging system includes a messaging server and an address resolution module. The messaging server receives a message from a sender system and transmits the message to the recipient system. The address resolution module, which is coupled to the messaging server, receives a declarative address associated with the message, resolves the declarative address into at least one messaging address and transmits the at least one messaging address to the messaging server. In one embodiment, a database system may be coupled to the address resolution module to allow address resolution based on information stored in a database. The address resolution module generates a database query based on the declarative address and transmits the generated query to a database system. The database system receives a database query, retrieves at least one messaging address specified by the query and transmits the retrieved at least one messaging address to the address resolution module.
Method for effective indexing of partially dynamic documents,
Tue Aug 13 18:08:24 EDT 2002
A method more efficiently indexes dynamic documents. The method adjusts the frequency with which dynamic documents are retrieved taking into account the extent to which the document varies between its most recent retrievals. Furthermore, the method selects portions of the document to be indexed based on the substance of the differences between recently retrieved copies.
Method and apparatus for substring selectivity estimation,
Tue Jun 04 18:08:12 EDT 2002
A method for estimating string-occurrence probability in a database comprises receiving a first probability of occurrence for each maximal substring from a plurality of substrings, each maximal substring in the plurality of substrings belonging to the string; obtaining an overall probability of occurrence; receiving a probability of occurrence for a maximal overlap of each maximal substring in the plurality of maximal substrings; obtaining a normalization factor; and dividing the overall probability of occurrence by the normalization factor to obtain the estimate.
Method of clustering electronic documents in response to a search query,
Tue Mar 26 18:07:36 EST 2002
A method of presenting clusters of documents in response to a search query where the documents within a cluster are determined to be related to one another. This relationship is assessed by comparing documents which match one or more terms in the query to determine the extent to which the documents have commonality with respect to terms appearing infrequently in the collection of documents. As a consequence, the cluster of documents represents a response or query result that is split across multiple documents. In a further variation the cluster can be constituted by a structured document and an unstructured document.
Method for providing more informative results in response to a search of electronic documents,
Tue Jan 08 18:07:20 EST 2002
A method provides a more informative result to a user in connection with the search for documents in a database. In particular, the method provides augmented addresses, in the Internet environment augmented universal resource locators, which include an indication of a document attribute which may be of interest to the user. Such attributes may include an indication of the language of the document (e.g., English or Japanese) or the popularity of the document.
Declarative Message Addressing,
Tue Apr 03 18:07:01 EDT 2001
A messaging system, and method of operation thereof, which supports combinations of directory and mailing list addressing mechanisms. Intended message recipients are specified as declarative addresses, which may include combinations of directory and mailing list information. The messaging system includes a messaging server and an address resolution module. The messaging server receives a message from a sender system and transmits the message to the recipient system. The address resolution module, which is coupled to the messaging server, receives a declarative address associated with the message, resolves the declarative address into at least one messaging address and transmits the at least one messaging address to the messaging server. In one embodiment, a database system may be coupled to the address resolution module to allow address resolution based on information stored in a database. The address resolution module generates a database query based on the declarative address and transmits the generated query to a database system. The database system receives a database query, retrieves at least one messaging address specified by the query and transmits the retrieved at least one messaging address to the address resolution module.
Method of clustering electronic documents in response to a search query,
Tue Dec 26 18:06:00 EST 2000
A method of presenting clusters of documents in response to a search query where the documents within a cluster are determined to be related to one another. This relationship is assessed by comparing documents which match one or more terms in the query to determine the extent to which the documents have commonality with respect to terms appearing infrequently in the collection of documents. As a consequence, the cluster of documents represents a response or query result that is split across multiple documents. In a further variation the cluster can be constituted by a structured document and an unstructured document.
Method for using region-sets to focus searches in hierarchical structures,
Tue Oct 17 18:06:52 EDT 2000
A method improves a search in a hierarchical structure by focusing the search to selected regions within the structure. The method defines one or more region-sets and uses the region-set(s) as either a filter for the results of a key-word search or an integrated part of a search engine to increase the efficiency of the search engine. The method also provides for dynamic creation of new region-sets from existing region-sets using a prescribed set of operators.
Method for improving the results of a search in a structured database,
Tue Jun 06 18:05:32 EDT 2000
A method enhances the presentation of search results from a structured database. In accordance with the method, a search query including two or more attribute/value pairs is presented to a system. The system then identifies a plurality of records which each minimally match the search query. Each document or record in the plurality of identified records is assigned a weight based on at least two factors: the extent to which the record matches the entire search query; and the relative frequency with which the attribute/value pair that matches the given record matches the records of the remainder of the structured database. The plurality of records that minimally match the search query are then identified to the requester in ranked order based on the assigned weights.
Method for providing more informative results in response to a search of electronic documents,
Tue May 30 18:05:32 EDT 2000
A method provides a more informative result to a user in connection with the search for documents in a database. In particular, the method provides augmented addresses, in the Internet environment augmented universal resource locators, which include an indication of a document attribute which may be of interest to the user. Such attributes may include an indication of the language of the document (e.g., English or Japanese) or the popularity of the document.
Sender-paid electronic messaging,
Tue Apr 04 18:05:31 EDT 2000
The present invention is a messaging system, and method of operation thereof, which provides message recipients with control over the delivery of message and charges the cost of a message to the sender of the message. A message is received at a messaging server from a sender system, the message including an indication of a recipient system. A notification message is transmitted to the recipient system, allowing the message recipient to determine whether they desire the message to be delivered. If so, an activation message is received from the recipient system and the message is transmitted to the recipient system. A charge for the message is assessed to the sender of the message. The message is stored in the messaging server until the activation message is received. At least a portion of the assessed charge may be credited or debited to the recipient of the message. The message may include any type of electronic information, such as text, graphics, video and audio information, and may be encrypted or unencrypted.
Cost-based maintenance of materialized views,
Tue Feb 15 18:05:30 EST 2000
A method of incrementally maintaining a first materialized view of data in a database, by means of an additional materialized view, first determines whether a cost in time of incrementally maintaining the first materialized view with the additional materialized view is less than the cost of incrementally maintaining the first materialized view without the additional materialized view. The method creates the additional materialized view only if the cost in time is less therewith. Determining whether the cost of employing an additional materialized view is less includes using an expression directed acyclic graph that corresponds to the first materialized view. Another method of determining whether the cost is less includes pruning an expression directed acyclic graph to produce a single expression tree, and using the single expression tree to determine whether the cost is less. Both the expression directed acyclic graph and the single expression tree contain equivalence nodes. One or more possible materialized views are selected by marking the equivalence nodes, and materializing one or more views corresponding to the marked equivalence nodes. One or more possible materialized views are also selected by determining which of the views, if materialized, would result in a lowest cost of incrementally maintaining the first materialized view. The method is also used to reduce the cost in time of maintaining a first materialized view employed to check an integrity constraint of the database.
Method of calculating tuples for data cubes,
Tue Nov 16 18:05:25 EST 1999
A method and apparatus of calculating data cubes is shown in which a data set is partitioned into memory sized data fragments and cuboid tuples are calculated from the data fragments. A search lattice of the data cube is used as a basis for ordering calculations of lower dimensional cuboids in the data cube. Identification of a minimum number of paths through the lattice that is sufficient to traverse all nodes in the lattice is achieved by iteratively duplicating twice all paths in a lower dimensional space, distributing a new attribute to the first duplicate, moving end points from paths of the second duplicate to a corresponding path in the first duplicate and merging the first and second duplicates. Joint work with Columbia University.
Method for using region-sets to focus searches in hierarchical structures,
Tue Oct 19 18:05:24 EDT 1999
A method improves a search in a hierarchical structure by focusing the search to selected regions within the structure. The method defines one or more region-sets and uses the region-set(s) as either a filter for the results of a key-word search or an integrated part of a search engine to increase the efficiency of the search engine. The method also provides for dynamic creation of new region-sets from existing region-sets using a prescribed set of operators.
Method for effective indexing of partially dynamic documents,
Tue Sep 21 01:05:24 EDT 1999
A method more efficiently indexes dynamic documents. The method adjusts the frequency with which dynamic documents are retrieved taking into account the extent to which the document varies between its most recent retrievals. Furthermore, the method selects portions of the document to be indexed based on the substance of the differences between recently retrieved copies.
Method And System For Using Materialized Views to Evaluate Queries Involving Aggregation,
Tue Apr 27 18:05:10 EDT 1999
Data warehouses are of considerable interest to AT&T, for supporting both traditional telephony services and Internet offerings such as WorldNet. Analyzing the large volumes of data stored in a warehouse using aggregate SQL queries is extremely useful for understanding and improving the quality of services supported by the warehouse. However, the sheer volume of detailed transaction data makes it impossible to store detailed records on disk over a long period of time. It is more feasible to maintain aggregated materialized views (for example, total revenue on a monthly basis for each customer) on disk, and the detailed transaction data on slower media such as tape. Our invention presents novel algorithms for determining whether the materialized views on disk can be used to speed up the computation of the answers to the aggregate SQL queries posed by the analysts, instead of using only the detailed transaction data to answer the queries.
ACM Fellow, 2011.
For contributions to query processing in data management systems.