Vision
Distributed systems in datacenters and the cloud are growing in complexity with systems now comprising hundreds to thousands of services. Managing the performance behaviors across the entire system is difficult due to the non-intuitive performance interactions between all the services. My research takes a three-pronged approach to managing performance: developing tools and techniques for automated performance debugging; designing automated performance management systems that are efficient and sustainable; developing new approaches for testing and evaluating system performance. The direct impact of my research is in developing new methodologies for managing performance and identifying performance issues, which will improve both system efficiency and the efficiency of software engineering. These techniques will benefit software developers who are proficient in their code base but are not necessarily performance experts; system designers who are looking to improve their system efficiency and sustainability through better resource management; systems researchers who need to run realistic experiments to study the performance of their prototypes.
Projects
-
Performance debugging [details]
Performance debugging is one of the most challenging types of debugging. Gartner estimates that root causes of performance problems have taken, on average, a week to diagnose. Large corporations can hire specialized performance engineers to do this work, which is expensive and not scalable. My research will automate the performance diagnosis process to make it accessible to a broader base of engineers that might not otherwise have this specialized knowledge. Our first work, tprof (SoCC 2021), demonstrates how to analyze distributed systems tracing data to automatically identify performance issues, and our code is open-sourced at https://github.com/lexiangh/tprof. Our tools will enable engineers to be more effective in investigating complex performance issues, and their efforts will be focused on fixing bugs rather than diagnosing them.
-
Metastable failures [details]
Metastable failures are a newly defined class of failures where a bad feedback cycle causes the system to get persistently stuck in an overloaded state. Even after the initial cause of the overload is fixed, the system is unable to recover due to some sustaining effect that perpetuates the overload. This type of failure has caused many of the major catastrophic outages at cloud providers such as Amazon AWS, Microsoft Azure, and Google Cloud, leading to millions of dollars of lost revenue. Based on our HotOS 2021 paper, our OSDI 2022 paper defines and establishes this class of failures and provides examples of it in practice. We have open-sourced some simplified examples of metastable failures at https://github.com/lexiangh/Metastability.
-
Evaluating system performance [details]
An important part of systems research and evaluation is being able to test systems under realistic settings. However, without direct access to production systems, one must rely upon reproducing the effects in experimental testbeds. This is typically done in ad-hoc manners for each project, and we as a research community often fall into evaluation pitfalls. My research will identify these pitfalls and introduce tools and techniques for performing realistic replay-based experiments that are repeatable/reproducible. Our first work, TraceSplitter (EuroSys 2021), investigates the problem of downscaling the load on a trace and introduces a new method for accurate downscaling. Our code is open-sourced at https://github.com/smsajal/TraceSplitter.
-
Designing heterogeneous systems [details]
I am working on designing heterogeneous systems that improve performance and/or cost. Heterogeneous systems allow for optimized designs that take advantage of the strengths of each type of component. We have worked on a variety of projects, such as taking advantage of GPU, CPU, and networking resources to improve Deep Neural Network (DNN) inference workloads (SIGMETRICS 2023, SYSTOR 2022). Our work is open-sourced at https://github.com/minus-one/splitrpc and https://github.com/minus-one/sir_plus. We have also considered how heterogeneity broadly can improve system design, particularly in the context of cloud computing. For example, our research demonstrates how to combine fast and slow servers to minimize tail latency under a fixed cost budget (WWW 2020). We have also studied how to efficiently manage a high degree of heterogeneity from the perspective of a cloud provider (OSDI 2023). The success of these works will demonstrate how intentionally taking advantage of heterogeneity can improve performance and cost under a variety of contexts.
-
Quality of Service (QoS) support for tail latency SLOs [details]
Meeting tail latency (e.g., 99.9th percentile) Service Level Objectives (SLOs) is important for many user-facing applications. Our IOFlow paper (SOSP 2013) introduces a QoS architecture for controlling congestion via rate limiting and prioritization of storage and network I/O. Our PriorityMeister (SoCC 2014) paper addresses how to automatically configure priorities and rate limits to meet tail latency SLOs using a Deterministic Network Calculus (DNC) analysis. Our SNC-Meister (SoCC 2016) paper shows significant improvements in admission control when using a probabilistic analysis called Stochastic Network Calculus (SNC) instead of DNC, which is a worst-case analysis. We are the first to build a computer system based on SNC, and our code is publicly available at: https://github.com/timmyzhu/SNC-Meister. Our WorkloadCompactor (SoCC 2017) paper studies how to jointly optimize rate limit parameter configuration with workload placement to minimize the number of servers needed to meet tail latency SLOs. Our code is publicly available at: https://github.com/timmyzhu/WorkloadCompactor.
-
Cluster scheduling on heterogeneous resources [details]
With heterogeneous resources comes new questions in scheduling. For example, is it better to statically partition specialized resources, or to dynamically schedule across a heterogeneous mixture of resources? When dynamically scheduling heterogeneous resources, should the scheduler wait for specialized resources to become available in the future or use slower alternative resources that are immediately available? Our TetriSched (EuroSys 2016) paper introduces a new cluster scheduler that optimizes when and where to run jobs so as to improve performance in heterogeneous clusters.
-
Autoscaling [details]
Autoscaling is a useful technique for adapting resource utilization to load. In my CacheScale (HotCloud 2012) work, I investigate techniques for elastically scaling memcached resources to reduce costs of cloud web services. As an alternative to autoscaling memcached servers, our SOFTScale (Middleware 2012) work performs cycle-stealing on memcached servers to help deal with bursts of work during periods of low load. I have also investigated autoscaling resources to meet deadlines for multi-phase batch jobs during an internship at Google.
Grants
Thanks to my sponsors for supporting my research through the following grants:
-
- NSF-2239291: CAREER: Auto-generated experimentation for performance diagnosis of distributed systems
- NSF-2324858: Collaborative Research: DESC: Type I: Extending lifetimes of partially broken machines to repurpose e-waste
- NSF-1909004: CNS Core: Small: A Multi-Stakeholder Integrated Approach to Reduce Tail Latency Using Heterogeneity
Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.