
180 Park Ave - Building 103
Florham Park, NJ
http://www2.research.att.com/~mthorup/
Tabulation Based 5-independent Hashing with Applications to Linear Probing and Second Moment Estimation
Mikkel Thorup, Yin Zhang
SIAM Journal on Computing,
2012.
[PDF]
[BIB]
SIAM Copyright
The definitive version was published in 2012 . , Volume 41, 2012-04-25
{}
The Power of Simple Tabulation Hashing
Mihai Patrascu, Mikkel Thorup
Proceedings of ACM STOC'11,
2011.
[PDF]
[BIB]
ACM Copyright
(c) ACM, 2011. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in Proceedings of ACM STOC'11. , 2011-06-06.
{Randomized algorithms are often enjoyed for their simplicity, but the
hash functions used to yield the desired theoretical guarantees are
often neither simple nor practical. Here we show that the simplest
possible tabulation hashing provides unexpectedly strong guarantees.
The scheme itself dates back to Carter and Wegman (STOC'77). Keys are
viewed as consisting of $c$ characters. We initialize $c$ tables $T_1,
dots, T_c$ mapping characters to random hash code. A key
$x=(x_1, dots, x_q)$ is hashed to $T_1[x_1] oplus cdots oplus
T_c[x_c]$, where $oplus$ denotes xor.
While this scheme is not even 4-independent, we show that it provides
many of the guarantees that are normally obtained via higher
independence, e.g., Chernoff-type concentration, min-wise hashing for
estimating set intersection, and cuckoo hashing.
}

Efficient stream sampling for variance-optimal estimation of subset sums
Edith Cohen, Nicholas Duffield, Carsten Lund, Mikkel Thorup, Haim Kaplan
SIAM Journal on Computing,
2011.
[PDF]
[BIB]
SIAM (Society fir Industrial and Applied Mathematics) Copyright
The definitive version was published in SIAM Journal on Computing. , 2011-06-01
{From a high volume stream of weighted items, we want to maintain a generic sample of a certain limited size k that we can later use to estimate the total weight of arbitrary subsets. This is the classic context of on-line reservoir sampling, thinking of the generic sample as a reservoir. We present an efficient reservoir sampling scheme, VAROPTk, that dominates all previous schemes in terms of estimation quality. VAROPTk provides variance optimal unbiased estimation of subset sums. More precisely, if we have seen n items of the stream, then for any subset size m, our scheme based on k samples minimizes the average variance over all subsets of size m. In fact, the optimality is against any off-line scheme with
k samples tailored for the concrete set of items seen. In addition to optimal average variance, our scheme provides tighter worst-case bounds on the variance of particular subsets than previously possible. It is efficient, handling each new item of the stream in O(log k) time. Finally, it is particularly well suited for
combination of samples from different streams in a distributed setting.
}

Don't Rush into a Union: Take Time to Find Your Roots
Mihai Patrascu, Mikkel Thorup
Proceedings of ACM STOC'11,
2011.
[PDF]
[BIB]
ACM Copyright
(c) ACM, 2011. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in Proceedings of ACM STOC'11. , 2011-06-06.
{We present a new threshold phenomenon in data structure lower bounds
where slightly reduced update times lead to exploding query
times. Consider incremental connectivity, letting $t_u$ be the time to
insert an edge and $t_q$ be the query time. For $t_u = Omega(t_q)$,
the problem is equivalent to the well-understood emph{union--find}
problem: $proc{InsertEdge}(s,t)$ can be implemented by
$proc{Union}(proc{Find}(s), proc{Find}(t))$. This gives worst-case
time $t_u = t_q = O(lg n / lglg n)$ and amortized $t_u = t_q =
O(alpha(n))$.
By contrast, we show that if $t_u = o(lg n / lglg n)$, the query
time explodes to $t_q ge n^{1-o(1)}$. In other words, if the data
structure doesn't have time to find the roots of each disjoint set
(tree) during edge insertion, there is no effective way to organize
the information!
For amortized complexity, we demonstrate a new inverse-Ackermann type
trade-off in the regime $t_u = o(t_q)$.
A similar lower bound is given for fully dynamic connectivity, where
an update time of $o(lg n)$ forces the query time to be
$n^{1-o(1)}$. This lower bound allows for amortization and Las Vegas
randomization, and comes close to the known $O(lg n cdot (lglg
n)^{O(1)})$ upper bound.
}
Optimal Combination Of Sampled Network Measurements,
Tue Sep 27 16:06:11 EDT 2011
Two regularized estimators that avoid the pathologies associated with variance estimation are disclosed. The regularized variance estimator adds a contribution to estimated variance representing the likely error, and hence ameliorates the pathologies of estimating small variances while at the same time allowing more reliable estimates to be balanced in the convex combination estimator. The bounded variance estimator employs an upper bound to the variance which avoids estimation pathologies when sampling probabilities are very small.
Variance-Optimal Sampling-Based Estimation Of Subset Sums,
Tue Aug 23 16:06:03 EDT 2011
The present invention relates to a method of obtaining a generic sample of an input stream. The method is designated as VAROPT.sub.k. The method comprises receiving an input stream of items arriving one at a time, and maintaining a sample S of items i. The sample S has a capacity for at most k items i. The sample S is filled with k items i. An n.sup.th item i is received. It is determined whether the n.sup.th item i should be included in sample S. If the n.sup.th item i is included in sample S, then a previously included item i is dropped from sample S. The determination is made based on weights of items without distinguishing between previously included items i and the n.sup.th item i. The determination is implemented thereby updating weights of items i in sample S. The method is repeated until no more items are received.
Methods And Apparatus To Bound Network Traffic Estimation Error For Multistage Measurement Sampling And Aggregation,
Tue Aug 02 16:05:49 EDT 2011
Methods and apparatus to bound network traffic estimation error for multistage measurement sampling and aggregation are disclosed. An example method disclosed herein comprises determining a hierarchical sampling topology representative of multiple data sampling and aggregation stages, the hierarchical sampling topology comprising a plurality of nodes connected by a plurality of edges, each node corresponding to at least one of a data source and a data aggregation operation, and each edge corresponding to a data sampling operation characterized by a generalized sampling threshold, selecting a first generalized sampling threshold from a set of generalized sampling thresholds associated with a respective set of edges originating at a respective set of descendent nodes of a target node undergoing network traffic estimation, and transforming a measured sample of network traffic into a confidence interval for a network traffic estimate associated with the target node using the first generalized sampling threshold and an error parameter.
System And Method For Analyzing Data Traffic,
Tue Dec 28 15:05:28 EST 2010
A method of analyzing data traffic includes receiving a request at a data analysis system to store a string related to header information associated with a data packet. The method also includes applying a hash function to the string, thereby obtaining a 32-bit intermediate, and applying another hash function to the 32-bit intermediate, thereby obtaining a hash number. Further, the method includes storing the string in an array position corresponding to the hash number, when the array position is empty.
Sampling And Analyzing Packets In A Network,
Tue Dec 14 15:05:20 EST 2010
The preferred embodiments of the present invention can include sampling packets transmitted over a network based on the content of the packets. If a packet is sampled, the sampling unit can add one or more fields to the sampled packet that can include a field for a number of bytes contained in the packet, a packet count, a flow count, a sampling type, and the like. The sampled packets can be analyzed to discern desired information from the packets. The additional fields that are added to the sampled packets can be used during the analysis.
Algorithms And Estimators For Summarization Of Unaggregated Data Streams,
Tue Jul 27 15:04:14 EDT 2010
The invention relates to streaming algorithms useful for obtaining summaries over unaggregated packet streams and for providing unbiased estimators for characteristics, such as, the amount of traffic that belongs to a specified subpopulation of flows. Packets are sampled from a packet stream and aggregated into flows and counted by implementation of: (a) Adaptive Sampled NetFlow (ANF), and adjusted weight (A.sup.A.sup.NF) of a flow (f) is calculated as follows: A.sup.A.sup.NF(f)=i(f)/p'; i(f) being the number of packets counted for a flow f, and p' being the sampling rate at end of a measurement period; or (b) Adaptive Sample-and-Hold (ASH), and adjusted weight (A.sup.A.sup.SH) of a flow (f) is calculated as follows: A.sup.A.sup.SH(f)=i(f)+(1-p')/p'; i(f) being the number of packets counted for a flow f, and p' being the sampling rate at end of a measurement period.
Algorithms And Estimators For Summarization Of Unaggregated Data Streams,
Tue Jun 29 15:04:09 EDT 2010
The invention relates to streaming algorithms useful for obtaining summaries over unaggregated packet streams and for providing unbiased estimators for characteristics, such as, the amount of traffic that belongs to a specified subpopulation of flows. Packets are sampled from a packet stream and aggregated into flows and counted by implementation of Adaptive Sample-and-Hold (ASH) or Adaptive NetFlow (ANF), adjusting the sampling rate based on a quantity of flows to obtain a sketch having a predetermined size, the sampling rate being adjusted in steps; and transferring the count of aggregated packets from SRAM to DRAM and initializing the count in SRAM following adjustment of the sampling rate.
Method And Apparatus For Providing Composite Link Assignment In Network Design,
Tue Oct 06 16:08:06 EDT 2009
A method and apparatus for composite link assignment are provided such that network capacity is sufficient to handle all the traffic (e.g., load) while an objective function, e.g., the total cost of the capacity is minimized. The present method receives a plurality of weights for a plurality of arcs and a load for the network. An objective function is selected for minimization, where the present method then determines the composite link assignment to handle the load while the objective function is minimized. In one embodiment, the composite link assignment comprises a plurality of different link types for the plurality of arcs.
Method And Apparatus For Updating A Shortest Path Graph,
Tue Sep 22 16:08:04 EDT 2009
A method and apparatus for updating a shortest path graph or a shortest path tree are disclosed. For example, an arc weight is changed for an arc in the network, where a plurality of affected nodes in the network is determined. The distance of each of the affected nodes is determined, where a subset of the plurality of affected nodes is then placed in a heap. One aspect of the present invention is that not all the affected nodes are placed in the heap. In one embodiment, the present reduced heap approach only applies the Dijkstra's algorithm to those affected nodes whose distances change in a smaller amount that the change in the arc weight. In turn, the shortest path graph or the shortest path tree is updated in accordance with the affected nodes placed in the heap.
Optimal Combination Of Sampled Measurements,
Tue May 19 16:07:00 EDT 2009
Two regularized estimators that avoid the pathologies associated with variance estimation are disclosed. The regularized variance estimator adds a contribution to estimated variance representing the likely error, and hence ameliorates the pathologies of estimating small variances while at the same time allowing more reliable estimates to be balanced in the convex combination estimator. The bounded variance estimator employs an upper bound to the variance which avoids estimation pathologies when sampling probabilities are very small.
Apparatus for size-dependent sampling for managing a data network,
Tue Nov 20 18:12:28 EST 2007
The present invention provides apparatus for sampling data flows in a data network in order to estimate a total data volume in the network. Sampling the data flows in the data network reduces the network resources that must be expended by the network to support the associated activity. The present invention enables the service provider of the data network to control sampled volumes in relation to the desired accuracy. The control can be either static or can be dynamic for cases in which the data volumes are changing as a function of time.
Methods and systems for optimizing network traffic,
Tue Feb 27 18:11:56 EST 2007
The present invention provides methods and systems for evaluating network traffic. By generating successive sets of weights relating to a performance surface using a variety of heuristic techniques, and then evaluating the weights using a piece-wise linear cost function, a number of performance minima can be found. By continuously searching the performance surface, a champion minimum can be extracted. Searching the performance surface can be quickly and efficiently accomplished using a variety functions such as an anti-cycling function, an impatience function, a dynamic graph technique and a diversity process.
Method and apparatus for size-dependent sampling for managing a data network,
Tue Jul 18 18:11:25 EDT 2006
The present invention provides a method and apparatus for sampling data flows in a data network in order to estimate a total data volume in the network. Sampling the data flows In the data network reduces the network resources that must be expended by the network to support the associated activity. The present Invention enables the service provider of the data network to control sampled volumes in relation to the desired accuracy. The control can be either static or can be dynamic for cases in which the data volumes are changing as a function of time
Methods And Systems For Fast Optimization Of Network Traffic,
Tue Dec 07 18:10:13 EST 2004
The present invention provides methods and systems for evaluating network traffic. By generating successive sets of weights relating to a performance surface using a variety of heuristic techniques, and then evaluating the weights using a piece-wise linear cost function, a number of performance minima can be found. By continuously searching the performance surface, a champion minimum can be extracted. Searching the performance surface can be quickly and efficiently accomplished using a variety functions such as an anti-cycling function, an impatience function, a dynamic graph technique and a diversity process.
Mathematical Association of America (MAA) David P. Robbins Prize, 2011.
For impressive results in discrete mathematics in the paper "Maximum Overhang" that solved a 150-year-old problem: How far can a stack of identical blocks hang over the edge of a table?
AT&T Fellow, 2010.
For outstanding innovation in algorithms, including advanced hashing and sampling techniques applied to AT&T's Internet traffic analysis and speech services.
Royal Danish Academy of Science & Letters, 2006.
ACM Fellow, 2005.
For contributions to algorithms and data structures.