MIRI Seminar on Data Streams, Spring 2014 Lab 1: HyperLogLog and CMSketch ========================================== Part 1: HyperLogLog Download HyperLog in a programming language you are comfortable with. I found: c++: https://github.com/hideo55/cpp-HyperLogLog/blob/master/src/hyperloglog.hpp Java: https://github.com/addthis/stream-lib/tree/master/src/main/java/com/clearspring/analytics/stream/cardinality Python: https://pypi.python.org/pypi/hyperloglog/0.0.8 Ruby: https://rubygems.org/gems/hyperloglog Perl: http://search.cpan.org/~hideakio/Algorithm-HyperLogLog-0.20/lib/Algorithm/HyperLogLog.pm JavaScript: http://cnpmjs.org/package/hyperloglog node.js: https://www.npmjs.org/package/streamcount and this which I don't know (tell me if you do): https://github.com/eclesh/hyperloglog/blob/master/hyperloglog.go To do: Study relative error achieved at a large count (say 10^6 - 10^7, or more) for a given space (= number of registers * number of bits per register). What fractions of the runs give >10%, >5%, >2%, >1%, >0.5%, relative error? Focus on the case of shortest registers. Part 2: CMSketch Download CMSketch. - Implementation by the authors in C: http://www.cs.rutgers.edu/~muthu/massdal-code-index.html - c++: https://github.com/alabid/countminsketch - Java: https://github.com/addthis/stream-lib/blob/master/src/main/java/com/clearspring/analytics/stream/frequency/CountMinSketch.java - javascript: https://github.com/KurtRMueller/CM-Sketch - node.js: https://www.npmjs.org/package/count-min-sketch Feed it with the following distributions. For each one, first guess (from then analysis) then verify for which elements in the set of items we get a reasonable frequency estimation. Use a fixed delta (e.g. 0.1) and epsilon (e.g. 0.01). - Uniform distribution on n items - Sum of two uniforms: a*[m]+(1-a)*[n], for m < n, 0 <= a <= 1 - distribution f_i proportional i^{a} (a < -1) - Exponential distribution f_i proportional to a^i, 0 < a < 1 (warning: unbounded range, use min(exponential(a),n-1)) Routines to generate these distributions are in distrib.h (in c++ only, sorry) together with a main that checks them.