Optimizing Resource Utilization, Efficiency and Scalability in Deep Learning Systems

Wu, Xiaofeng

dc.contributor.advisor	Jiang, Hong
dc.creator	Wu, Xiaofeng
dc.date.accessioned	2023-06-14T17:04:54Z
dc.date.available	2023-06-14T17:04:54Z
dc.date.created	2023-05
dc.date.issued	2023-05-01
dc.date.submitted	May 2023
dc.identifier.uri	http://hdl.handle.net/10106/31210
dc.description.abstract	This thesis addresses the challenges of utilization, efficiency, and scalability faced by deep learning systems, which are essential for high-performance training and serving of deep learning models. Deep learning systems play a critical role in developing accurate and complex models for various applications, including image recognition, natural language understanding, and speech recognition. This research focuses on understanding and developing deep learning systems that encompass data preprocessing, resource management, multi-tenancy, and distributed model training. The thesis proposes several solutions to improve the performance, scalability, and efficiency of deep learning applications. Firstly, we introduce SwitchFlow, a scheduling framework that addresses the limitations of popular deep learning frameworks in supporting GPU sharing and multi-tasking. Secondly, we propose Atom, a distributed training framework for large language models that utilizes decentralized training to reduce communication costs and increase scalability. We discuss the challenges of decentralized training and present the design and implementation of Atom. Lastly, we introduce PerFect, a method that pre-trains the model using repetitive data to improve data processing efficiency and fine-tunes it to achieve the desired accuracy. Our approach provides a significant improvement in the performance, scalability, and efficiency of deep learning applications. Specifically, SwitchFlow reduces interference and eliminates out-of-memory errors by scheduling subgraphs instead of computation graphs as a whole. Additionally, it allows subgraphs running on different devices to overlap with each other, leading to a more efficient execution pipeline. Atom achieves high training throughput and fault-tolerance in a decentralized environment, enabling the training of massive-scale models using affordable hardware such as consumer-class GPUs and Ethernet. Finally, PerFect improves the throughput performance of the data preprocessing stage and achieves the desired accuracy when reusing cached data, without the need for additional hardware or third-party libraries. The proposed frameworks and solutions are evaluated using representative DL models, and the results demonstrate their effectiveness and scalability. Overall, this thesis contributes to the development of deep learning systems and provides practical solutions to the challenges of utilization, efficiency, and scalability, making deep learning applications more accessible and efficient for a wider range of users.
dc.format.mimetype	application/pdf
dc.language.iso	en_US
dc.subject	Optimization
dc.subject	Resource utilization
dc.subject	Efficiency
dc.subject	Scalability
dc.subject	Deep learning systems
dc.title	Optimizing Resource Utilization, Efficiency and Scalability in Deep Learning Systems
dc.type	Thesis
dc.date.updated	2023-06-14T17:04:54Z
thesis.degree.department	Computer Science and Engineering
thesis.degree.grantor	The University of Texas at Arlington
thesis.degree.level	Doctoral
thesis.degree.name	Doctor of Philosophy in Computer Science
dc.type.material	text

Files in this item

Name:: WU-DISSERTATION-2023.pdf
Size:: 5.814Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Show simple item record