Show simple item record

dc.contributor.advisorChe, Hao
dc.contributor.advisorLei, Yu (Jeff)
dc.creatorLi, Zhongwei
dc.date.accessioned2020-01-10T21:09:19Z
dc.date.available2020-01-10T21:09:19Z
dc.date.created2019-12
dc.date.issued2019-12-06
dc.date.submittedDecember 2019
dc.identifier.urihttp://hdl.handle.net/10106/28850
dc.description.abstractPerformance evaluation and resource provisioning are two most critical factors to be considered for designers of distributed systems at modern warehouse data centers. The ever-increasing volumes of data in recent years have pushed many businesses to move their computing tasks to the Cloud, which offers many benefits including the low system management and maintenance costs and better scalability. As a result, most recent prominently emerging workloads are data-intensive, calling for scaling out the workload to a large number of servers for parallel processing. Questions can be asked as what factors impact the system scaling performance, and how to efficiently schedule tasks to the distributed comping resources. This dissertation introduces a new performance model to address the former problem and an effective hierarchical job scheduler for the latter. The major contribution of this dissertation is to introduce our new performance modeling approach designed for data-intensive applications, which consists of two phases: 1) In-Proportion and Scale-Out-induced scaling model (IPSO), 2) Unified Scaling model for Big data Analytics (USBA). The first model we build is based on the traditional performance models including both Amdahl's and Gustafson's laws. We clearly demonstrate in this research why these classic models are insufficient and inadequate in today's parallel computing environment and how IPSO model may fill the gap. While at the second phase we extend IPSO for today's multi-staged workloads, such model can be easily adopted at modeling data analytic applications running at Spark platform. Both models are supported by our evaluations on well-known benchmarks and evidences from other publications. To the best of our knowledge, IPSO is the first variation of the classic Amdahl's model that can be directly applied to modern data-intensive applications. A light-weighted tool is also developed at the end of this research, which can be used for generating IPSO inputs or a Spark application log analyzer. The tool is developed as an open source project and accessible in public repository. The second contribution of this dissertation is the Pigeon job scheduler we propose for the modern data centers. Pigeon is a distributed, hierarchical job scheduler based on a two-layer design. It offloads the service pressure in widely adopted centralized data center scheduler by quickly dispatching the incoming tasks to selected nodes known as masters, then guarantees the efficiency of task execution by enforcing its unique queuing mechanism on these masters. Pigeon can minimize the chance of head-of-line blocking for short jobs and avoid starvation for long jobs, and outperform Sparrow (distributed scheduler) and Eagle (hybrid scheduler) based on our evaluations. Pigeon is also an open sourced tool that can be accessed from public repository. This dissertation is presented in an article-based format and includes three research papers. The first chapter is an introduction to all contents in this dissertation. The second chapter reports our performance evaluation model (IPSO). The third chapter reports IPSO's extended model for multi-staged workloads (USBA). The fourth chapter reports our work on Pigeon scheduler. Finally the fifth concludes all work and the plan for the following research target.
dc.format.mimetypeapplication/pdf
dc.subjectData-intensive
dc.subjectBig data
dc.subjectPerformance modeling
dc.subjectResource provisioning
dc.subjectJob scheduler
dc.subjectHPC
dc.subjectDistributed systems
dc.titlePERFORMANCE MODELING AND RESOURCE PROVISIONING FOR DATA-INTENSIVE APPLICATIONS
dc.typeThesis
dc.degree.departmentComputer Science and Engineering
dc.degree.nameDoctor of Philosophy in Computer Science
dc.date.updated2020-01-10T21:11:29Z
thesis.degree.departmentComputer Science and Engineering
thesis.degree.grantorThe University of Texas at Arlington
thesis.degree.levelDoctoral
thesis.degree.nameDoctor of Philosophy in Computer Science
dc.type.materialtext


Files in this item

Thumbnail


This item appears in the following Collection(s)

Show simple item record