SwitchFlow: Preemptive Multitasking for Deep Learning
View/ Open
Date
2021-12-10Author
Wu, Xiaofeng
Rao, Jia
Wei, Chen
Huang, Heng
Ding, Chris
Huang, Hang
Metadata
Show full item recordAbstract
Accelerators, such as GPU, are a scarce resource in deep learning
(DL). Effectively and efficiently sharing GPU leads to improved
hardware utilization as well as user experiences, who may need to
wait for hours to access GPU before a long training job is done.
Spatial and temporal multitasking on GPU have been studied in the
literature, but popular deep learning frameworks, such as TensorFlow and PyTorch, lack the support of GPU sharing among multiple
DL models, which are typically represented as computation graphs,
heavily optimized by underlying DL libraries, and run on a complex pipeline spanning CPU and GPU. Our study shows that GPU
kernels, spawned from computation graphs, can barely execute simultaneously on a single GPU and time slicing may lead to low
GPU utilization.
This paper presents SwitchFlow, a scheduling framework for DL
multitasking. It centers on two designs. First, instead of scheduling a
computation graph as a whole, SwitchFlow schedules its subgraphs
and prevents subgraphs from different models to run simultaneously
on a GPU. This results in less interference and the elimination of
out-of-memory errors. Moreover, subgraphs running on different
devices can overlap with each other, leading to a more efficient
execution pipeline. Second, SwitchFlow maintains multiple versions
of each subgraph. This allows subgraphs to be migrated across
devices at a low cost, thereby enabling low-latency preemption.
Results on representative DL models show that SwitchFlow achieves
up to an order of magnitude lower tail latency for inference requests
collocated with a training job.