BOA Constrictor: Squeezing Performance out of GPUs in the Cloud via Budget-Optimal Allocation

Published:

Download here

Zhouzi Li, Cindy Zhu, Arpan Mukhopadhyay, Mor Harchol-Balter, Benjamin Berg (Under submission)

Abstract: The past decade has seen a dramatic increase in demand for GPUs to train Machine Learning (ML) models. Because it is prohibitively expensive for most organizations to build and maintain a large GPU cluster, organizations instead choose to rent GPUs from cloud providers. To balance the cost-performance tradeoff, we develop BOA Constrictor, a new scheduler for ML training jobs which uses a Budget-Optimal Allocation (BOA) policy to squeeze the highest level of performance out of a cloud-deployed GPU cluster given a fixed budget constraint. Our BOA policy can be computed efficiently for any budget level and therefore provides users with the optimal tradeoff between cost and performance. For a given budget level, we demonstrate that BOA Constrictor can reduce average JCT by 1.6× in small-scale implementation experiments and by 2× in detailed, large-scale simulations compared to state-of-the-art heuristic based schedulers.