Skip to content

OptiReduce - Optimizing Large-Scale ML Training

OptiReduce

OptiReduce

OptiReduce is a resilient and tail-optimal collective-communication framework for distributed deep learning in the cloud. It seamlessly integrates with PyTorch's Distributed Data Parallel (DDP) training to optimize communication performance in high-tail latency environments.

Why OptiReduce?

🚀 Faster Training

Eliminates bottlenecks from stragglers and slow workers
Optimizes performance in high-tail latency environments
Reduces training time, cost, and resource usage

💡 Key Features

Bounded-Loss Reliability: Speeds up training by tolerating small controlled loss
Tail Latency Optimization: Efficiently handles network variability
PyTorch Integration: Seamless integration with PyTorch DDP

🔧 Technical Highlights

Compatible with Mellanox ConnectX NICs
Efficient memory pool management
Zero-copy operations for maximum performance

Choose your path with OptiReduce

👉 Quick Start

📚 Learn More

Research

OptiReduce was presented at NSDI '25. Read our paper:

📄 OptiReduce: Resilient and Tail-Optimal AllReduce for Distributed Deep Learning in the Cloud

Support

Need help? Here are your options:

Review all the documentation
Open an issue
Review installation guide for setup help