Benchmarking Parallel and Heterogeneous stream data processing systems

Supervisor: Pratyush Agnihotri
KOM-ID: KOM-M-0770
Link zur Ausschreibung

Distributed Stream Processing Systems (DSPS) allow real-time processing of data to fetch meaningful information. Big giants like Twitter, Facebook, and Alibaba rely on DSP for real-time data analytics. For example, click analytics, credit card fraud detection, etc. In DSPS, queries are usually long-running. They typically deal with a very high workload of millions or even billions of events per second. Under such scenarios, parallelism plays an important role in providing scalability for DSPS where either the input stream is partitioned into multiple processes or sub-streams or duplicated to be processed by operators and their instances. However, existing DSPS lacks the benchmarking of the parallel processing capability of operators under various workloads, event rates, and different queries.

Research Goals:

  • Literature study of benchmarking parallel stream processing.
  • Implementing Window and Pane-based parallelization strategy.
  • Evaluating the performance for various benchmarks
  • Displaying the performance evaluation result on the front end.
  • Possible DSPS to be used: Apache Flink