Five papers by CSE researchers at NSDI 2025

CSE researchers are presenting new research in the area of networked and distributed systems, including slow-fault tolerance, programmable traffic control, and cloud-based deep learning.

CSE-affiliated authors are presenting five papers at the 22nd USENIX Symposium on Networked Systems Design and Implementation (NSDI). One of the top conferences in computer systems, NSDI is a leading venue for new research on the design, implementation, and evaluation of networked and distributed systems. This year’s event is being held in Philadelphia, PA on April 28-30.

New research by CSE authors at NSDI focuses on several topics within the area of networked systems, including enhancing slow-fault tolerance in distributed systems, unlocking ECMP (equal-cost multi-path) programmability for precise traffic control, and resilient communication in cloud-based deep learning. Additional studies explore secure password pre-authentication with content delivery networks and optimizing multi-WAN transport for 5G networks. The papers being presented are as follows, the names of authors affiliated with CSE in bold:

One-Size-Fits-None: Understanding and Enhancing Slow-Fault Tolerance in Modern Distributed Systems
Ruiming Lu, Yunchi Lu, Yuxuan Jiang, Guangtao Xue, Peng Huang

Abstract: Recent studies have shown that various hardware components exhibit fail-slow behavior at scale. However, the characteristics of distributed software’s tolerance of such slow faults remain ill-understood. This paper presents a comprehensive study that investigates the characteristics and current practices of slow-fault tolerance in modern distributed software. We focus on the fundamentally nuanced nature of slow faults. We develop a testing pipeline to systematically introduce diverse slow faults, measure their impact under different workloads, and identify the patterns. Our study shows that even small changes can lead to dramatically different reactions. While some systems have added slow-fault handling mechanisms, they are mostly controlled by static thresholds, which can hardly accommodate the highly sensitive and dynamic characteristics. To address this gap, we design ADR, a lightweight library to use within system code and make fail-slow handling adaptive. Evaluation shows ADR significantly reduces the impact of slow faults.

A flowchart illustrating the authors' testing pipeline, from the initial phase to warm-up to slow-fault injection to recovery.
The authors’ slow-fault tolerance testing pipeline.

Unlocking ECMP Programmability for Precise Traffic Control
Yadong Liu, Yunming Xiao, Xuan Zhang, Weizhen Dang, Huihui Liu, Xiang Li, Zekun He, Jilong Wang, Aleksandar Kuzmanovic, Ang Chen, Congcong Miao

Abstract: ECMP (equal-cost multi-path) has become a fundamental mechanism in data centers, which distributes flows along multiple equivalent paths based on their hash values. Randomized distribution optimizes for the aggregate case, spreading load across flows over time. However, there exists a class of important Precise Traffic Control (PTC) tasks that are at odds with ECMP randomness. For instance, if an end host perceives that its flows are traversing a problematic switch/link, it often needs to change their paths before a fix can be rolled out. With randomized hashing, existing solutions resort to modifying flow tuples; since hashing mechanisms are unknown and they vary across switches/vendors, it may take many trials before yielding a new path. Many other similar cases exist where precise and timely response is critical to the network.

We propose programmable ECMP (P-ECMP), a programming model, compiler, and runtime that provides precise traffic control. P-ECMP leverages an oft-ignored feature, ECMP groups, which allows for a constrained set of capabilities that are nonetheless sufficiently expressive for our tasks. An operator supplies high-level descriptions of their topology and policies, and our compiler generates PTC configurations for each switch. End hosts can reconfigure specific flows to use different PTC policies precisely and quickly, addressing a range of important use cases. We have evaluated P-ECMP using simulation at scale, and deployed one use case to a real-world data center that serves live user traffic.

OptiReduce: Resilient and Tail-Optimal AllReduce for Distributed Deep Learning in the Cloud
Ertza Warraich, Omer Shabtai, Khalid Manaa,Shay Vargaftik, Yonatan Piasetzky, Matty Kadosh, Lalith Suresh, Muhammad Shahbaz

Abstract: We present OptiReduce, a new collective-communication system for the cloud with bounded, predictable completion times for deep-learning jobs in the presence of varying computation (stragglers) and communication (congestion and gradient drops) variabilities. OptiReduce exploits the inherent resiliency and the stochastic nature of distributed deep-learning (DDL) training and fine-tuning to work with approximated (or lost) gradients—providing an efficient balance between (tail) performance and the resulting accuracy of the trained models.

Exploiting this domain-specific characteristic of DDL, OptiReduce introduces (1) mechanisms (e.g., unreliable bounded transport with adaptive timeout) to improve the DDL jobs’ tail execution time, and (2) strategies (e.g., Transpose AllReduce and Hadamard Transform) to mitigate the impact of gradient drops on model accuracy. Our evaluation shows that OptiReduce achieves 70% and 30% faster time-to-accuracy (TTA), on average, when operating in shared, cloud environments (e.g., CloudLab) compared to Gloo and NCCL, respectively.

Two stacked flow chart graphics. Top: Ring All Reduce has seven columns each with four rows of white rectangles. Bottom: OptiReduce has five columns each with four rows of white rectangles. Equations illustrate that OptiReduce processes with fewer steps by adding time boundaries
OptiReduce improves latency compared to previous methods like Ring AllReduce by reducing the number of rounds with incast parameter and setting boundaries to the path delay.

Efficient Multi-WAN Transport for 5G with OTTER
Mary Hogan, Gerry Wan, Yiming Qiu, Sharad Agarwal, Ryan Beckett,Rachee Singh, Paramvir Bahl

Abstract: In the ongoing cloudification of 5G, software network functions (NFs) are replacing fixed-function network hardware, allowing 5G network operators to leverage the benefits of cloud computing. The migration of NFs and their management to the cloud causes 5G traffic to traverse an operator’s wide-area network (WAN) to the cloud WAN that hosts the datacenters (DCs) running 5G NFs and applications. However, achieving end-to-end (E2E) performance for 5G traffic across two WANs is hard. Placing 5G flows across two WANs with different performance and reliability characteristics, edge and DC resource constraints, and interference from other flows is different and more challenging than single-WAN traffic engineering. We address this challenge and show that orchestrating E2E paths across a multi-WAN overlay allows us to achieve average 13% more throughput, 15% less RTT, 45% less jitter, or reduce average loss from 0.06% to under 0.001%. We implement our multi-WAN 5G flow placement in a scalable optimization prototype that allocates 26%–45% more bytes on the network than greedy baselines while also satisfying the service demands of more flows.

PreAcher: Secure and Practical Password Pre-Authentication by Content Delivery Networks
Shihan Lin, Suting Chen, Yunming Xiao, Yanqi Gu, Aleksandar Kuzmanovic, Xiaowei Yang

Abstract: In today’s Internet, websites widely rely on password authentication for user logins. However, the intensive computation required for password authentication exposes web servers to Application-layer DoS (ADoS) attacks that exploit the login interfaces. Existing solutions fail to simultaneously prevent such ADoS attacks, preserve password secrecy, and maintain good usability. In this paper, we present PreAcher, a system architecture that incorporates third-party Content Delivery Networks (CDNs) into the password authentication process and offloads the authentication workload to CDNs without divulging the passwords to them. At the core of PreAcher is a novel three-party authentication protocol that combines Oblivious Pseudorandom Function (OPRF) and Locality-Sensitive Hashing (LSH). This protocol allows CDNs to pre-authenticate users and thus filter out ADoS traffic without compromising password security. Our evaluations demonstrate that PreAcher significantly enhances the resilience of web servers against both ADoS attacks and preserves password security while introducing acceptable overheads. Notably, PreAcher can be deployed immediately by websites alone today, without modifications to client software or CDN infrastructure. We release the source code of PreAcher to facilitate its deployment and future research.