Date: 2011-09-30
Time: 15:30-16:30
Location: BURN 1205
Abstract:
Streaming data is ubiquitous in a wide range of areas from engineering and information technology, finance, and commerce, to atmospheric physics, and earth sciences. The online approximation of properties of data streams is of great interest, but this approximation process is hindered by the sheer size of the data and the speed at which it is generated. Data stream algorithms typically allow only one pass over the data, and maintain sub-linear representations of the data from which target properties can be inferred with high efficiency.
In this talk we consider the online approximation of two important characterizations of data streams: cardinality and empirical Shannon entropy. We assume that the number of distinct elements observed in the stream is prohibitively large, so that the vector of cumulative quantities cannot be stored on main computer memory for fast and efficient access. We focus on two techniques that use pseudo-random variates to form low-dimensional data sketches (using hashing and random projections), and derive estimators of the cardinality and empirical entropy. We discuss various properties of our estimators such as relative asymptotic efficiency, recursive computability, and error and complexity bounds. Finally, we present results on simulated data and seismic measurements from a volcano.
Speaker
Ioana A. Cosma is a Postdoctoral Fellow in the Statistical Laboratoty at Cambridge University, England. She holds a Ph.D. in Statistics from the University of Oxford.
References:
Peter Clifford and Ioana A. Cosma (2011) “A statistical analysis of probabilistic counting algorithms” (to appear in the Scandinavian Journal of Statistics, preprint on arXiv:0801.3552).
Peter Clifford and Ioana A. Cosma (2009)“A simple sketching algorithm for entropy estimation” (in preparation, preprint on arXiv:0908.3961).