Streaming Analytics

Minos Garofalakis, Director, Institute for the Management of Information Systems (IMIS), “Athena” Research and Innovation Centre, Athens, Greece  and Professor, School of Electrical and Computer Engineering, Technical University of Crete, Chania, Greece

Effective Big Data analytics need to rely on algorithms for querying and analyzing massive, continuous data streams (that is, data that is seen only once and in a fixed order) with limited memory and CPU-time resources. Such streams arise naturally in emerging large-scale event monitoring applications; for instance, network-operations monitoring in large ISPs, where usage information from numerous network devices needs to be continuously collected and analyzed for interesting trends and real-time reaction to different scenarios (e.g., hotspots or DDoS attacks). In addition to memory- and time-efficiency concerns, the inherently distributed nature of such applications also raises important communication-efficiency issues, making it critical to carefully optimize the use of the underlying communication infrastructure. This course will provide an overview of some key algorithmic tools for supporting effective, real-time analytics over streaming data. Our primary focus will be on small-space sketch synopses for approximating continuous data streams, and their applicabilty in both centralized and distributed settings.

Syllabus:

1. Introduction and Motivation

2. Data Streaming Models and Mathematical Tools

3. Basic Algorithmic Tools for Data Streams

    * Reservoir Sampling

    * Bag Synopses: AMS and CountMin Sketches

    * Set Synopses: FM Sketches and Distinct Sampling

4. The Sliding Window Model and Exponential Histograms

5. Distributed Data Streaming

    * Basic Models and Techniques

    * The Geometric Method and Convex Safe Zones

6. Conclusions and Looking Forward

7. Hands-on Experience with Streaming Tools

References:

Surveys/Monographs:

1. Graham Cormode, Minos Garofalakis, Peter J. Haas, and Chris Jermaine. “Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches”, Foundations and Trends in Databases 4(1-3): 1-294 (2012)

2. Minos Garofalakis, Johannes Gehrke, and Rajeev Rastogi (Eds.). “Data-Stream Management — Processing High-Speed Data Streams”, Springer-Verlag, New York (Data-Centric Systems and Applications Series), July 2016 (ISBN 978-3-540-28607-3).

Papers:

1. Noga Alon, Yossi Matias, Mario Szegedy: The Space Complexity of Approximating the Frequency Moments. ACM STOC 1996.

2. Noga Alon, Phillip B. Gibbons, Yossi Matias, Mario Szegedy: Tracking Join and Self-Join Sizes in Limited Storage. ACM PODS 1999.

3. Graham Cormode, S. Muthukrishnan: An Improved Data Stream Summary: The Count-Min Sketch and Its Applications. LATIN 2004.

4. Phillip B. Gibbons: Distinct Sampling for Highly-Accurate Answers to Distinct Values Queries and Event Reports. VLDB 2001.

5. Mayur Datar, Aristides Gionis, Piotr Indyk, Rajeev Motwani: Maintaining Stream Statistics over Sliding Windows. SIAM J. on Computing 31(6), 2002.

6. Graham Cormode, Minos N. Garofalakis: Approximate continuous querying over distributed streams. ACM Trans. Database Syst. 33(2), 2008.

7. Izchak Sharfman, Assaf Schuster, Daniel Keren: A geometric approach to monitoring threshold functions over distributed data streams. ACM SIGMOD Conference 2006.

8. Minos N. Garofalakis, Daniel Keren, Vasilis Samoladas: Sketch-based Geometric Monitoring of Distributed Stream Queries. PVLDB 6(10), 2013.

9. Arnon Lazerson, Izchak Sharfman, Daniel Keren, Assaf Schuster, Minos N. Garofalakis, Vasilis Samoladas: Monitoring Distributed Streams using Convex Decompositions. PVLDB 8(5), 2015.