SwitchFlow: Preemptive Multitasking for Deep Learning

Wu, Xiaofeng; Rao, Jia; Wei, Chen; Huang, Heng; Ding, Chris; Huang, Hang

dc.contributor.author	Wu, Xiaofeng
dc.contributor.author	Rao, Jia
dc.contributor.author	Wei, Chen
dc.contributor.author	Huang, Heng
dc.contributor.author	Ding, Chris
dc.contributor.author	Huang, Hang
dc.date.accessioned	2023-07-25T17:27:02Z
dc.date.available	2023-07-25T17:27:02Z
dc.date.issued	2021-12-10
dc.identifier.uri	http://hdl.handle.net/10106/31596
dc.description.abstract	Accelerators, such as GPU, are a scarce resource in deep learning (DL). Effectively and efficiently sharing GPU leads to improved hardware utilization as well as user experiences, who may need to wait for hours to access GPU before a long training job is done. Spatial and temporal multitasking on GPU have been studied in the literature, but popular deep learning frameworks, such as TensorFlow and PyTorch, lack the support of GPU sharing among multiple DL models, which are typically represented as computation graphs, heavily optimized by underlying DL libraries, and run on a complex pipeline spanning CPU and GPU. Our study shows that GPU kernels, spawned from computation graphs, can barely execute simultaneously on a single GPU and time slicing may lead to low GPU utilization. This paper presents SwitchFlow, a scheduling framework for DL multitasking. It centers on two designs. First, instead of scheduling a computation graph as a whole, SwitchFlow schedules its subgraphs and prevents subgraphs from different models to run simultaneously on a GPU. This results in less interference and the elimination of out-of-memory errors. Moreover, subgraphs running on different devices can overlap with each other, leading to a more efficient execution pipeline. Second, SwitchFlow maintains multiple versions of each subgraph. This allows subgraphs to be migrated across devices at a low cost, thereby enabling low-latency preemption. Results on representative DL models show that SwitchFlow achieves up to an order of magnitude lower tail latency for inference requests collocated with a training job.	en_US
dc.language.iso	en_US	en_US
dc.publisher	ACM	en_US
dc.subject	Deep learning framework, preemption scheduling, systems for machine learning	en_US
dc.title	SwitchFlow: Preemptive Multitasking for Deep Learning	en_US
dc.type	Article	en_US

Files in this item

Name:: 3464298.3493391.pdf
Size:: 1.473Mb
Format:: PDF
Description:: Journal Article

View/Open

This item appears in the following Collection(s)

Articles published in Association for Computing Machinery (ACM) Journals - DO NOT EDIT

Show simple item record