Learned Parallel Stream Processing using Zero-Shot Cost Models
January 01, 2023 – ,
Motivation
In Distributed Stream Processing Systems (DSPS), queries are usually long-running. They typically deal with a very high workload of million or even billion of events per second. Under such scenarios, parallelism plays an important role in providing scalability for DSPS. One of the core decisions here is to determine the right parallelization degree to process a high load of events and queries while ensuring high throughput and low latency requirements. In this context, parallelization can be controlled by defining the degree of parallelism for each operator in the operator graph. The main questions we want to investigate here are -
- What is the right parallelism degree for query processing? (what is the performance in terms of end-to-end latency for parallel stream processing? At what extend the model can be used - seen and unseen behaviour? What are the limitations of models? At which can be extended?)
- How to parallelize the processing of DSPs operators?
Prerequisites
There are no hard guidelines for this topic. However, it is preferable if you have:
- Good programming skills in Java
- Good understanding of the concepts of machine learning
- Understanding of the concept of communication networks, big data processing engines, e.g., Apache Flink, Kafka.
Reference Literature
- Three steps is all you need: fast, accurate, automatic scaling decisions for distributed streaming dataflows.
- A Comprehensive Survey on Parallelization and Elasticity in Stream Processing
- Resource Management and Scheduling in Distributed Stream Processing Systems: A Taxonomy, Review, and Future Directions
- One Model to Rule them All: Towards Zero-Shot Learning for Databases Benjamin
- Apache Flink documentation: nightlies.apache.org/flink/flink-docs-master/docs/learn-flink/overview/
Keywords: Big Data, Parallel Stream Processing
Research Area(s):
Tutor: Agnihotri,
Open Theses