Technical Details
This document describes the technical architecture and implementation details of OptiReduce.
Architecture Overview
OptiReduce integrates with PyTorch's DistributedDataParallel (DDP) through the Gloo backend, implementing a tail-optimized DPDK-based communication layer.
Component Stack
graph TD
A[PyTorch DDP] --> B[Gloo Backend]
B --> C[OptiReduce Layer]
C --> D[DPDK Runtime]
D --> E[Network Hardware]
Core Components
PyTorch Integration
OptiReduce seamlessly integrates with PyTorch through:
- Standard DDP interface
- Gloo backend integration
- Simple activation via
GLOO_ALGO=Optireduce
DPDK Communication Layer
Our DPDK layer provides:
- Zero-copy packet processing
- Direct NIC access with kernel bypass
- Multiple dedicated rings:
- 4 RX rings for receiving
- 1 TX ring for transmission
- Optimized ring buffers:
- RX ring size: 8192 entries
- TX ring size: 128 entries
- Optimized memory pool management
Packet Structure
OptiReduce uses a custom packet format:
+----------------+-------------+------------+-------------+------------------+
| Ethernet HDR | IP HDR | UDP HDR | OptiReduce | Payload |
| | | | HDR | |
+----------------+-------------+------------+-------------+------------------+
The OptiReduce header contains:
struct rte_ult_hdr {
uint64_t offset; // Offset in the buffer
uint16_t counter; // Message counter
uint16_t timeout; // Timeout value
uint16_t length; // Payload length
size_t rank; // Sender rank
bool last; // Last packet indicator
};
Resource Requirements
CPU Resources
- 4 dedicated cores required for RX processing:
- Each core handles a separate RX ring
- Dedicated threads for receive and reduce operations
- Two parallel allreduce operations from PyTorch DDP
- Separate rings prevent gradient mixing between concurrent operations
Configuration
Environment Variables
# Core Configuration
GLOO_ALGO="OptiReduce" # Enable OptiReduce
GLOO_SOCKET_IFNAME="ens17" # Network interface
GLOO_DPDK_TIMEOUT=10000 # Operation timeout (ms)
GLOO_DPDK_THREADS_OFFSET=11 # Core offset for threads
GLOO_DPDK_SEND_TIMER=true # Enable sender timeout
GLOO_DPDK_FILE_PREFIX="/path/to/file.log" # Log file path
GLOO_DPDK_CONFIG="/path/to/dpdk.cfg" # DPDK config file location
DPDK Configuration File
The configuration file maps IP addresses to MAC addresses:
192.168.100.10=AA:BB:CC:DD:EE:FF
192.168.100.11=AA:BB:CC:DD:EE:00
Current Limitations
- Maximum 2 concurrent buckets supported
- Must configure DDP with a large bucket size:
model = DDP(model, bucket_cap_mb=1350)
Info
An experimental branch supporting an arbitrary number of buckets per allreduce call exists at OptiReduce Setup. Feel free to check it out!
Network Requirements
- Open-source version requires rate control for >10 Gbps networks
- Requires DPDK-compatible network cards
- Optimized for Mellanox ConnectX NICs
Future Development
Planned enhancements include:
- Officially support extended bucket
- Open-source Timely-based rate control
Learn More
- Read our research paper for in-depth design details
- Review installation guide for setup instructions
- Explore benchmarking guide to evaluate performance in your environment