Skip to content

Technical Details

This document describes the technical architecture and implementation details of OptiReduce.

Architecture Overview

OptiReduce integrates with PyTorch's DistributedDataParallel (DDP) through the Gloo backend, implementing a tail-optimized DPDK-based communication layer.

Component Stack

graph TD
    A[PyTorch DDP] --> B[Gloo Backend]
    B --> C[OptiReduce Layer]
    C --> D[DPDK Runtime]
    D --> E[Network Hardware]

Core Components

PyTorch Integration

OptiReduce seamlessly integrates with PyTorch through:

  • Standard DDP interface
  • Gloo backend integration
  • Simple activation via GLOO_ALGO=Optireduce

DPDK Communication Layer

Our DPDK layer provides:

  • Zero-copy packet processing
  • Direct NIC access with kernel bypass
  • Multiple dedicated rings:
    • 4 RX rings for receiving
    • 1 TX ring for transmission
  • Optimized ring buffers:
    • RX ring size: 8192 entries
    • TX ring size: 128 entries
    • Optimized memory pool management

Packet Structure

OptiReduce uses a custom packet format:

+----------------+-------------+------------+-------------+------------------+
| Ethernet HDR   | IP HDR      | UDP HDR    | OptiReduce  | Payload          |
|                |             |            | HDR         |                  |
+----------------+-------------+------------+-------------+------------------+

The OptiReduce header contains:

struct rte_ult_hdr {
    uint64_t offset;    // Offset in the buffer
    uint16_t counter;   // Message counter
    uint16_t timeout;   // Timeout value
    uint16_t length;    // Payload length
    size_t rank;        // Sender rank
    bool last;          // Last packet indicator
};

Resource Requirements

CPU Resources

  • 4 dedicated cores required for RX processing:
    • Each core handles a separate RX ring
    • Dedicated threads for receive and reduce operations
    • Two parallel allreduce operations from PyTorch DDP
    • Separate rings prevent gradient mixing between concurrent operations

Configuration

Environment Variables

# Core Configuration
GLOO_ALGO="OptiReduce"                 # Enable OptiReduce
GLOO_SOCKET_IFNAME="ens17"             # Network interface
GLOO_DPDK_TIMEOUT=10000                # Operation timeout (ms)
GLOO_DPDK_THREADS_OFFSET=11            # Core offset for threads
GLOO_DPDK_SEND_TIMER=true              # Enable sender timeout
GLOO_DPDK_FILE_PREFIX="/path/to/file.log"  # Log file path
GLOO_DPDK_CONFIG="/path/to/dpdk.cfg"       # DPDK config file location

DPDK Configuration File

The configuration file maps IP addresses to MAC addresses:

192.168.100.10=AA:BB:CC:DD:EE:FF
192.168.100.11=AA:BB:CC:DD:EE:00

Current Limitations

  • Maximum 2 concurrent buckets supported
  • Must configure DDP with a large bucket size:
    model = DDP(model, bucket_cap_mb=1350)
    

Info

An experimental branch supporting an arbitrary number of buckets per allreduce call exists at OptiReduce Setup. Feel free to check it out!

Network Requirements

  • Open-source version requires rate control for >10 Gbps networks
  • Requires DPDK-compatible network cards
  • Optimized for Mellanox ConnectX NICs

Future Development

Planned enhancements include:

  • Officially support extended bucket
  • Open-source Timely-based rate control

Learn More