Tracking and mitigating tail latency in data centers

High tail latency has been identified as one of the key challenges facing modern data center design.

data center Enlarge

As data centers grow to support large-scale Internet services, the impact of scale, complexity, and ongoing operational changes can adversely affect tail latency, which is the 95th or 99th percentile response latency of a service. High tail latency has been identified as one of the key challenges facing modern data center design as it results in poor user experiences, particularly for interactive services such as web search and social networks.

The research team of CSE PhD candidate Yunqi ZhangDavid Meisner (CSE MSE PhD 2009 2012) of Facebook, Prof. Lingjia Tang, and Prof. Jason Mars has developed a modular load tester platform for data centers which is designed to help measure and mitigate tail latency. Called Treadmill, it is described in their paper, “Treadmill: Attributing the Source of Tail Latency through Precise Load Testing and Statistical Inference.”

Mitigating tail latency is extremely challenging in part because accurately measuring tail latency without disrupting the production system is highly desirable but particularly challenging to achieve due to the number of systems and resources involved for large-scale Internet service workloads (e.g., distributed server-side software, network connections, etc.) than traditional single-server workloads. Current state-of-the-art load testing systems have several pitfalls in their designs that often result in misleading conclusions and can result in unnecessary resource over-provisioning and unexplained performance regressions.

In addition to accurately measuring tail latency, attributing the source of tail latency to different hardware and software causes is also challenging. Treadmill addresses both of these challenges, overcoming the pitfalls of existing tools. The researchers used their methodology to evaluate the impact of common server hardware features with Facebook production workloads on production hardware, producing superior results, particularly in capturing complicated and counter-intuitive performance behaviors. By tuning the hardware features as suggested by the attribution, they reduced the 99th-percentile latency by 43% and its variance by 93%.

Treadmill is now deployed and widely used in Facebook’s production services. Treadmill is also open-sourced and available to the general public and industry practitioners.

David Meisner

Prof. Lingjia Tang received her PhD in Computer Science at The University of Virginia in 2012 and joined the faculty at Michigan in 2013. Before joining U-M, she was a research faculty in the CSE department at The University of California, San Diego. She has authored a number of best papers and had publications chosen for IEEE Micro’s Top Picks in 2012 and 2016.

Prof. Jason Mars received his PhD in Computer Science at The University of Virginia in 2012 and joined the faculty at Michigan in 2013. Before joining U-M, he was an assistant professor in the CSE department at The University of California, San Diego. He has also served as visiting scientist at Google, which involved investigating opportunities to improve efficiency of Google’s backend infrastructure. He has also received numerous honors and awards including the UVA Preuss Faculty Scholar Appointment, the UVA Research Award, and a Google Research Award in 2013. He received an NSF CAREER Award in 2015. He had a publication chosen for IEEE Micro’s Top Picks in 2016.