Using OptiReduce
This guide explains how to use OptiReduce with PyTorch Distributed Data Parallel (DDP) training. OptiReduce integrates seamlessly with PyTorch's DDP using the Gloo backend.
Prerequisites
Before using OptiReduce, ensure you have:
Network Setup
- Mellanox ConnectX NIC (recommended)
- Or two NICs: one for TCP and one DPDK-compatible NIC
- DPDK v20.11 (installed automatically with OptiReduce)
System Configuration
- Hugepages configuration (16GB total)
- At least 4 dedicated CPU cores for OptiReduce
Configuration
1. Hugepages Setup
Configure 16GB of hugepages using one of these methods:
Using 1GB Hugepages (Recommended)
# Add to GRUB_CMDLINE_LINUX in /etc/default/grub
default_hugepagesz=1G hugepagesz=1G hugepages=16
# Update and reboot
sudo update-grub
sudo reboot
Using 2MB Hugepages
# Add to GRUB_CMDLINE_LINUX in /etc/default/grub
default_hugepagesz=2M hugepagesz=2M hugepages=8192
# Update and reboot
sudo update-grub
sudo reboot
Verify configuration:
cat /proc/meminfo | grep Huge
2. DPDK Configuration
Create a DPDK configuration file (dpdk.cfg
) mapping IP addresses to MAC addresses for all nodes:
192.168.100.10=AA:BB:CC:DD:EE:FF
192.168.100.11=AA:BB:CC:DD:EE:00
3. Environment Variables
Set these required environment variables:
# Enable OptiReduce
export GLOO_ALGO=Optireduce
# Network interface to use
export GLOO_SOCKET_IFNAME="ens17" # Use your DPDK-enabled NIC name
# Path to config file (default: ./dpdk.cfg)
export GLOO_DPDK_CONFIG="/path/to/dpdk.cfg"
# Core offset for DPDK threads (requires 4 cores)
export GLOO_DPDK_THREADS_OFFSET=11
# Timeout for allreduce operations (milliseconds)
export GLOO_DPDK_TIMEOUT=10000
# Enable sender-side timeouts (Optional: off by default)
export GLOO_DPDK_SEND_TIMER=true
Basic Usage
Here's how to use OptiReduce with PyTorch DDP:
import os
import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
def setup_optireduce():
# Set OptiReduce environment variables
os.environ["GLOO_ALGO"] = "Optireduce"
os.environ["GLOO_DPDK_CONFIG"] = "/path/to/dpdk.cfg"
os.environ["GLOO_SOCKET_IFNAME"] = "ens17"
# Initialize process group
dist.init_process_group(backend="gloo")
def main():
# Setup OptiReduce
setup_optireduce()
# Create model
model = YourModel()
# CRITICAL: Must set bucket_cap_mb=1350
# OptiReduce supports only 2 concurrent buckets so set bucket_cap_mb to a large value
model = DDP(model, bucket_cap_mb=1350) # DO NOT CHANGE THIS VALUE
# Your training loop here
...
if __name__ == "__main__":
main()
Important
The current implementation supports only two concurrent buckets. You must set bucket_cap_mb=1350
(or a large value) when creating your DDP model. Failing to do so can lead to crashes.
Running Training and Performance Evaluation
For running training with OptiReduce, we provide ready-made scripts for various models (VGG19, BERT, BART, RoBERTa, GPT2) in our benchmark repository.
To evaluate performance:
- Follow our benchmarking guide
- Use provided scripts to simulate different network conditions
- Compare OptiReduce with other communication schemes
Next Steps
- See Installation Instructions for setup details
- Review Technical Details for architecture information
- Check Benchmarks for performance evaluation guide