Skip to content

Using OptiReduce

This guide explains how to use OptiReduce with PyTorch Distributed Data Parallel (DDP) training. OptiReduce integrates seamlessly with PyTorch's DDP using the Gloo backend.

Prerequisites

Before using OptiReduce, ensure you have:

Network Setup

  • Mellanox ConnectX NIC (recommended)
  • Or two NICs: one for TCP and one DPDK-compatible NIC
  • DPDK v20.11 (installed automatically with OptiReduce)

System Configuration

  • Hugepages configuration (16GB total)
  • At least 4 dedicated CPU cores for OptiReduce

Configuration

1. Hugepages Setup

Configure 16GB of hugepages using one of these methods:

# Add to GRUB_CMDLINE_LINUX in /etc/default/grub
default_hugepagesz=1G hugepagesz=1G hugepages=16

# Update and reboot
sudo update-grub
sudo reboot

Using 2MB Hugepages

# Add to GRUB_CMDLINE_LINUX in /etc/default/grub
default_hugepagesz=2M hugepagesz=2M hugepages=8192

# Update and reboot
sudo update-grub
sudo reboot

Verify configuration:

cat /proc/meminfo | grep Huge

2. DPDK Configuration

Create a DPDK configuration file (dpdk.cfg) mapping IP addresses to MAC addresses for all nodes:

192.168.100.10=AA:BB:CC:DD:EE:FF
192.168.100.11=AA:BB:CC:DD:EE:00

3. Environment Variables

Set these required environment variables:

# Enable OptiReduce
export GLOO_ALGO=Optireduce

# Network interface to use
export GLOO_SOCKET_IFNAME="ens17"  # Use your DPDK-enabled NIC name

# Path to config file (default: ./dpdk.cfg)
export GLOO_DPDK_CONFIG="/path/to/dpdk.cfg"

# Core offset for DPDK threads (requires 4 cores)
export GLOO_DPDK_THREADS_OFFSET=11

# Timeout for allreduce operations (milliseconds)
export GLOO_DPDK_TIMEOUT=10000

# Enable sender-side timeouts (Optional: off by default)
export GLOO_DPDK_SEND_TIMER=true

Basic Usage

Here's how to use OptiReduce with PyTorch DDP:

import os
import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP

def setup_optireduce():
    # Set OptiReduce environment variables
    os.environ["GLOO_ALGO"] = "Optireduce"
    os.environ["GLOO_DPDK_CONFIG"] = "/path/to/dpdk.cfg"
    os.environ["GLOO_SOCKET_IFNAME"] = "ens17"

    # Initialize process group
    dist.init_process_group(backend="gloo")

def main():
    # Setup OptiReduce
    setup_optireduce()

    # Create model
    model = YourModel()

    # CRITICAL: Must set bucket_cap_mb=1350
    # OptiReduce supports only 2 concurrent buckets so set bucket_cap_mb to a large value
    model = DDP(model, bucket_cap_mb=1350)  # DO NOT CHANGE THIS VALUE

    # Your training loop here
    ...

if __name__ == "__main__":
    main()

Important

The current implementation supports only two concurrent buckets. You must set bucket_cap_mb=1350 (or a large value) when creating your DDP model. Failing to do so can lead to crashes.

Running Training and Performance Evaluation

For running training with OptiReduce, we provide ready-made scripts for various models (VGG19, BERT, BART, RoBERTa, GPT2) in our benchmark repository.

To evaluate performance:

  1. Follow our benchmarking guide
  2. Use provided scripts to simulate different network conditions
  3. Compare OptiReduce with other communication schemes

Next Steps