The documents distributed by this server have been provided by the contributing authors as a means to ensure timely dissemination of scholarly and technical work on a non-commercial basis. Copyright and all rights therein are maintained by the authors or by other copyright holders, not withstanding that they have offered their works here electronically. It is understood that all persons copying this information will adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder.

Generalized Cost-Based Job Scheduling in Very Large Heterogeneous Cluster Systems

Author:Wasiur KhudaBukhsh, Sounak Kar, Bastian Alt, Amr Rizk, Heinz Koeppl
Date:May 2020
Kind:Article - use for journal articles only
Journal:IEEE Transactions on Parallel and Distributed Systems
Keywords:Job Scheduling, performance evaluation, mean-field limit
Research Area(s):Network Mechanisms & QoS
Abstract:We study job assignment in large, heterogeneous resource-sharing clusters of servers with finite buffers. This load balancing problem arises naturally in today's communication and big data systems, such as Amazon Web Services, Network Service Function Chains, and Stream Processing. Arriving jobs are dispatched to a server, following a load balancing policy that optimizes a performance criterion such as job completion time. Our contribution is a randomized CBS policy in which the job assignment is driven by general cost functions of the server queue lengths. Beyond existing schemes, such as the JSQ, the power of d or the SQ(d) and the capacity-weighted JSQ, the notion of CBS yields new application-specific policies such as hybrid locally uniform JSQ. As today's data center clusters have thousands of servers, exact analysis of CBS policies is tedious. In this work, we derive a scaling limit when the number of servers grows large, facilitating a comparison of various CBS policies with respect to their transient as well as steady-state behavior. A byproduct of our derivations is the relationship between the queue filling proportions and the server buffer sizes, which cannot be obtained from infinite buffer models. Finally, we provide extensive numerical evaluations and discuss several applications including multi-stage systems.
Full paper (pdf)

[Export this entry to BibTeX]