Technical Details

This document describes the technical architecture and implementation details of OptiReduce.

Architecture Overview

OptiReduce integrates with PyTorch's DistributedDataParallel (DDP) through the Gloo backend, implementing a tail-optimized DPDK-based communication layer.

Component Stack

graph TD
    A[PyTorch DDP] --> B[Gloo Backend]
    B --> C[OptiReduce Layer]
    C --> D[DPDK Runtime]
    D --> E[Network Hardware]

Core Components

PyTorch Integration

OptiReduce seamlessly integrates with PyTorch through:

Standard DDP interface
Gloo backend integration
Simple activation via GLOO_ALGO=Optireduce

DPDK Communication Layer

Our DPDK layer provides:

Zero-copy packet processing
Direct NIC access with kernel bypass
Multiple dedicated rings:
- 4 RX rings for receiving
- 1 TX ring for transmission
Optimized ring buffers:
- RX ring size: 8192 entries
- TX ring size: 128 entries
- Optimized memory pool management

Packet Structure

OptiReduce uses a custom packet format:

+----------------+-------------+------------+-------------+------------------+
| Ethernet HDR   | IP HDR      | UDP HDR    | OptiReduce  | Payload          |
|                |             |            | HDR         |                  |
+----------------+-------------+------------+-------------+------------------+

The OptiReduce header contains:

struct rte_ult_hdr {
    uint64_t offset;    // Offset in the buffer
    uint16_t counter;   // Message counter
    uint16_t timeout;   // Timeout value
    uint16_t length;    // Payload length
    size_t rank;        // Sender rank
    bool last;          // Last packet indicator
};

Resource Requirements

CPU Resources

4 dedicated cores required for RX processing:
- Each core handles a separate RX ring
- Dedicated threads for receive and reduce operations
- Two parallel allreduce operations from PyTorch DDP
- Separate rings prevent gradient mixing between concurrent operations

Configuration

Environment Variables

# Core Configuration
GLOO_ALGO="OptiReduce"                 # Enable OptiReduce
GLOO_SOCKET_IFNAME="ens17"             # Network interface
GLOO_DPDK_TIMEOUT=10000                # Operation timeout (ms)
GLOO_DPDK_THREADS_OFFSET=11            # Core offset for threads
GLOO_DPDK_SEND_TIMER=true              # Enable sender timeout
GLOO_DPDK_FILE_PREFIX="/path/to/file.log"  # Log file path
GLOO_DPDK_CONFIG="/path/to/dpdk.cfg"       # DPDK config file location

DPDK Configuration File

The configuration file maps IP addresses to MAC addresses:

192.168.100.10=AA:BB:CC:DD:EE:FF
192.168.100.11=AA:BB:CC:DD:EE:00

Current Limitations

Maximum 2 concurrent buckets supported
Must configure DDP with a large bucket size:
```
model = DDP(model, bucket_cap_mb=1350)
```

Info

An experimental branch supporting an arbitrary number of buckets per allreduce call exists at OptiReduce Setup. Feel free to check it out!

Network Requirements

Open-source version requires rate control for >10 Gbps networks
Requires DPDK-compatible network cards
Optimized for Mellanox ConnectX NICs

Future Development

Planned enhancements include:

Officially support extended bucket
Open-source Timely-based rate control

Learn More

Read our research paper for in-depth design details
Review installation guide for setup instructions
Explore benchmarking guide to evaluate performance in your environment