
180 Park Ave - Building 103
Florham Park, NJ
Structure Aware Sampling on Data Streams
Edith Cohen, Graham Cormode, Nicholas Duffield
ACM Sigmetrics 2011 Conference,
2011.
[PDF]
[BIB]
ACM Copyright
(c) ACM, 2011. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in ACM SIGmetrics 2011 Conference , 2011-06-07.
{The massive data streams observed in network monitoring, data processing and
scientific studies are typically too large to store. For many applications over such data, we must obtain compact summaries of the stream. These summaries should allow accurate answering of post hoc queries with estimates which approximate the true answers over the original stream. The data often has an underlying structure which makes certain subset queries, in particular {em range queries}, more relevant than arbitrary subsets. Applications such as access control, change detection, and heavy hitters typically involve subsets that are ranges
or unions thereof.
Random sampling is a natural summarization tool, being easy to implement and flexible to use. Known sampling methods are good for arbitrary queries but fail to optimize for the common case of range queries. Meanwhile, specialized summarization algorithms have been proposed for range-sum queries and related problems. These can outperform sampling giving fixed space resources, but lack
its flexibility and simplicity. Particularly, their accuracy degrades when queries span multiple ranges.
We define new stream sampling algorithms with a smooth and tunable trade-off between accuracy on range-sum queries and arbitrary subset-sum queries. The technical key is to relax requirements on the variance over all subsets to enable better performance on the ranges of interest. This boosts the accuracy on range queries while retaining the prime benefits of sampling, in particular flexibility and accuracy, with tail bounds having optimality guarantees. Our experimental study indicates that structure-aware summaries drastically improve range-sum accuracy with respect to state-of-the-art stream sampling algorithms and outperform deterministic methods on range-sum queries and hierarchical heavy
hitter queries.}

Efficient stream sampling for variance-optimal estimation of subset sums
Edith Cohen, Nicholas Duffield, Carsten Lund, Mikkel Thorup, Haim Kaplan
SIAM Journal on Computing,
2011.
[PDF]
[BIB]
SIAM (Society fir Industrial and Applied Mathematics) Copyright
The definitive version was published in SIAM Journal on Computing. , 2011-06-01
{From a high volume stream of weighted items, we want to maintain a generic sample of a certain limited size k that we can later use to estimate the total weight of arbitrary subsets. This is the classic context of on-line reservoir sampling, thinking of the generic sample as a reservoir. We present an efficient reservoir sampling scheme, VAROPTk, that dominates all previous schemes in terms of estimation quality. VAROPTk provides variance optimal unbiased estimation of subset sums. More precisely, if we have seen n items of the stream, then for any subset size m, our scheme based on k samples minimizes the average variance over all subsets of size m. In fact, the optimality is against any off-line scheme with
k samples tailored for the concrete set of items seen. In addition to optimal average variance, our scheme provides tighter worst-case bounds on the variance of particular subsets than previously possible. It is efficient, handling each new item of the stream in O(log k) time. Finally, it is particularly well suited for
combination of samples from different streams in a distributed setting.
}
Method And Apparatus For Improving End To End Performance Of A Data Network,
Tue Nov 29 16:06:40 EST 2011
A request is received at a resource server for a first resource, the request accompanied by a proxy filter. A second resource is identified based on the proxy filter and based on a relationship between the first resource and the second resource. The first resource and information regarding the second resource is provided to a network interface for communication to a proxy server.
Sampling And Analyzing Packets In A Network,
Tue Dec 14 15:05:20 EST 2010
The preferred embodiments of the present invention can include sampling packets transmitted over a network based on the content of the packets. If a packet is sampled, the sampling unit can add one or more fields to the sampled packet that can include a field for a number of bytes contained in the packet, a packet count, a flow count, a sampling type, and the like. The sampled packets can be analyzed to discern desired information from the packets. The additional fields that are added to the sampled packets can be used during the analysis.
Algorithms And Estimators For Summarization Of Unaggregated Data Streams,
Tue Jul 27 15:04:14 EDT 2010
The invention relates to streaming algorithms useful for obtaining summaries over unaggregated packet streams and for providing unbiased estimators for characteristics, such as, the amount of traffic that belongs to a specified subpopulation of flows. Packets are sampled from a packet stream and aggregated into flows and counted by implementation of: (a) Adaptive Sampled NetFlow (ANF), and adjusted weight (A.sup.A.sup.NF) of a flow (f) is calculated as follows: A.sup.A.sup.NF(f)=i(f)/p'; i(f) being the number of packets counted for a flow f, and p' being the sampling rate at end of a measurement period; or (b) Adaptive Sample-and-Hold (ASH), and adjusted weight (A.sup.A.sup.SH) of a flow (f) is calculated as follows: A.sup.A.sup.SH(f)=i(f)+(1-p')/p'; i(f) being the number of packets counted for a flow f, and p' being the sampling rate at end of a measurement period.
Algorithms And Estimators For Summarization Of Unaggregated Data Streams,
Tue Jun 29 15:04:09 EDT 2010
The invention relates to streaming algorithms useful for obtaining summaries over unaggregated packet streams and for providing unbiased estimators for characteristics, such as, the amount of traffic that belongs to a specified subpopulation of flows. Packets are sampled from a packet stream and aggregated into flows and counted by implementation of Adaptive Sample-and-Hold (ASH) or Adaptive NetFlow (ANF), adjusting the sampling rate based on a quantity of flows to obtain a sketch having a predetermined size, the sampling rate being adjusted in steps; and transferring the count of aggregated packets from SRAM to DRAM and initializing the count in SRAM following adjustment of the sampling rate.
Method and apparatus for efficient routing of variable traffic,
Tue Jun 17 18:12:54 EDT 2008
A method and apparatus for provide highly efficient traffic routing for a wide range of possible traffic matrices (TM) in an intra-domain network. That routing optimally balances the traffic loads over a range of traffic matrices so as to minimize the deviation for any particular traffic matrix from the optimal routing. Such a routing provides a guaranteed performance ratio against the best possible network routing. The invention utilizes a method of optimally configuring a traffic network based on solving a linear program to obtain the optimal routing, and then configuring the routing on the network accordingly.
Method and apparatus for improving end to end performance of a data network,
Tue Jun 15 18:09:52 EDT 2004
A method and apparatus provide improved cache coherency and more effective caching operations without placing an undue burden on network links. A proxy receives a request for a resource and then, depending on information in the proxy cache, generates a resource request for transmission to a resource server. The proxy appends a proxy filter to the request. The resource server maintains one or more volumes of resources based on some predetermined criterion that can be either static or dynamic in nature. Upon receipt of the request and the proxy filter the resource server generates a request response and a piggybacked list of additional resources selected from the volume with which the requested resource is associated.
Method For Preconnecting To A Server On A Network,
Tue Aug 12 18:08:50 EDT 2003
A method is disclosed for reducing user-perceived latency due to the time required to establish a connection to a server in a network. In accordance with the present invention, an open connection is established to a set of servers, there being some probability the user will contact one of the servers in the near future. This is referred to as preconnecting or prefetching the connection. In the context of a Web client-server network, a list of likely servers can be deduced from links on a current Web page a user is looking at or from a more sophisticated analysis of the user's browsing habits. When the user requests a resource from one of the identified servers, the network connection has already been established, thereby reducing latency and improving service quality, especially for higher bandwidth clients for whom the delay is most noticeable. In contrast to conventional document prefetching, preconnecting does not hog network bandwidth or consume cache space, and hence can be used with much less scrutiny. Moreover, the technique can be implemented in Web browsers without protocol modifications or changes to Web server code.
Method and apparatus for improving end to end performance of a data network,
Tue Dec 11 18:07:18 EST 2001
A method and apparatus provide improved cache coherency and more effective caching operations without placing an undue burden on network links. A proxy receives a request for a resource and then, depending on information in the proxy cache, generates a resource request for transmission to a resource server. The proxy appends a proxy filter to the request. The resource server maintains one or more volumes of resources based on some predetermined criterion that can be either static or dynamic in nature. Upon receipt of the request and the proxy filter the resource server generates a request response and a piggybacked list of additional resources selected from the volume with which the requested resource is associated.
Retrieval System and Method,
Tue Sep 07 18:05:15 EDT 1999
Many pattern recognition tasks, including estimation, classification, and the finding of similar objects, make use of linear models. For example, many text retrieval systems represent queries as linear functions, and retrieve documents whose vector representation has a high dot product with the query. The fundamental operation in such tasks is the computation of the dot product between a query vector and a large database of instance vectors. Often instance vectors which have high dot products with the query are of interest. The invention relates to a random sampling based retrieval system that can identify, for any given query vector, those instance vectors which have large dot products, while avoiding explicit computation of all dot products.
IEEE Communications Society William R. Bennett Prize, 2007.