ATTENTION: The works hosted here are being migrated to a new repository that will consolidate resources, improve discoverability, and better show UTA's research impact on the global community. We will update authors as the migration progresses. Please see MavMatrix for more information.
Show simple item record
dc.contributor.author | Wu, Xiaofeng | |
dc.contributor.author | Rao, Jia | |
dc.contributor.author | Wei, Chen | |
dc.contributor.author | Huang, Heng | |
dc.contributor.author | Ding, Chris | |
dc.contributor.author | Huang, Hang | |
dc.date.accessioned | 2023-07-25T17:27:02Z | |
dc.date.available | 2023-07-25T17:27:02Z | |
dc.date.issued | 2021-12-10 | |
dc.identifier.uri | http://hdl.handle.net/10106/31596 | |
dc.description.abstract | Accelerators, such as GPU, are a scarce resource in deep learning
(DL). Effectively and efficiently sharing GPU leads to improved
hardware utilization as well as user experiences, who may need to
wait for hours to access GPU before a long training job is done.
Spatial and temporal multitasking on GPU have been studied in the
literature, but popular deep learning frameworks, such as TensorFlow and PyTorch, lack the support of GPU sharing among multiple
DL models, which are typically represented as computation graphs,
heavily optimized by underlying DL libraries, and run on a complex pipeline spanning CPU and GPU. Our study shows that GPU
kernels, spawned from computation graphs, can barely execute simultaneously on a single GPU and time slicing may lead to low
GPU utilization.
This paper presents SwitchFlow, a scheduling framework for DL
multitasking. It centers on two designs. First, instead of scheduling a
computation graph as a whole, SwitchFlow schedules its subgraphs
and prevents subgraphs from different models to run simultaneously
on a GPU. This results in less interference and the elimination of
out-of-memory errors. Moreover, subgraphs running on different
devices can overlap with each other, leading to a more efficient
execution pipeline. Second, SwitchFlow maintains multiple versions
of each subgraph. This allows subgraphs to be migrated across
devices at a low cost, thereby enabling low-latency preemption.
Results on representative DL models show that SwitchFlow achieves
up to an order of magnitude lower tail latency for inference requests
collocated with a training job. | en_US |
dc.language.iso | en_US | en_US |
dc.publisher | ACM | en_US |
dc.subject | Deep learning framework, preemption scheduling, systems for machine learning | en_US |
dc.title | SwitchFlow: Preemptive Multitasking for Deep Learning | en_US |
dc.type | Article | en_US |
Files in this item
- Name:
- 3464298.3493391.pdf
- Size:
- 1.473Mb
- Format:
- PDF
- Description:
- Journal Article
This item appears in the following Collection(s)
Show simple item record