OptiReduce Benchmarking Guide

This guide explains how to run benchmarks with OptiReduce under different network conditions using controlled background traffic to simulate various network environments.

Note

If you have already set up the environment using make optireduce-full from the ansible repository, you can skip directly to the Running Training section.

Installation Options

Option 1: Using Ansible (Recommended)

The easiest way to install the benchmark is using our Ansible playbooks:

git clone https://github.com/OptiReduce/ansible.git
cd ansible
make benchmark-only

For detailed instructions on using the Ansible deployment, visit our Ansible documentation.

Option 2: Using Benchmark Repository

We provide automated install scripts in our benchmark repository:

# Clone the benchmark repository
git clone https://github.com/OptiReduce/benchmark.git
cd benchmark

# Install benchmark
make install

Option 3: Manual Installation

If you prefer to install manually, follow these steps:

Install Redis server:

sudo apt update
sudo apt install redis-server

Clone and build Gloo benchmark:

# Clone specific version of Gloo
git clone https://github.com/facebookincubator/gloo.git
cd gloo
git checkout e6d509b527712a143996f2f59a10480efa804f8b

# Create build directory
mkdir build
cd build

# Configure and build
cmake ../ -DUSE_REDIS=1 -DBUILD_BENCHMARK=1
make -j$(nproc)

Background Traffic Setup

Start Redis Server

Choose any ONE node to run the Redis server:

# Start Redis server
redis-server --port 6199 --protected-mode no

# Clear Redis entries (run before each benchmark)
redis-cli -p 6199 FLUSHALL

Important

Always clear Redis entries before starting a new benchmark run.

Create Network Environment

You can simulate different network environments by varying the number of workers for background traffic script (run_background.sh). This will result in different tail-to-median latency ratios and allows you to simulate different network conditions by adjusting the SIZE parameter:

Usage: ./run_background.sh -s SIZE -r STARTING_RANK -t TIME [-H REDIS_HOST] [-p REDIS_PORT] [-d DEVICE]

Options:
  -s SIZE           Number of processes
  -r STARTING_RANK  Starting rank
  -t TIME           Iteration time in seconds
  -H REDIS_HOST     Redis host (default: 192.168.100.30)
  -p REDIS_PORT     Redis port (default: 6199)
  -d DEVICE         Network device (default: ens17)

Example Low-Tail Environment (p99/p50 = 1.5x)

Run these commands on any two nodes:

# On first node
./run_background.sh -s 4 -r 0 -t 240000 -H <redis_host> -d ens17

# On second node
./run_background.sh -s 4 -r 1 -t 240000 -H <redis_host> -d ens17

Example High-Tail Environment (p99/p50 = 3x)

For the high-tail environment, increase the SIZE parameter:

# On first node
./run_background.sh -s 16 -r 0 -t 240000 -H <redis_host> -d ens17

# On second node
./run_background.sh -s 16 -r 1 -t 240000 -H <redis_host> -d ens17

Parameter Explanation

-s SIZE: Number of processes to spawn. Higher values create more background traffic:
- 4 processes: Creates low-tail environment (p99/p50 ≈ 1.5x)
- 16 processes: Creates high-tail environment (p99/p50 ≈ 3x)
-r STARTING_RANK: Starting rank for processes (0 or 1 for two-node setup)
-t TIME: Duration of background traffic in seconds
-H REDIS_HOST: Redis server IP address (must be same for all nodes)
-p REDIS_PORT: Redis server port (default: 6199)
-d DEVICE: Network interface name (e.g., ens17)

Environment Configuration

The size parameters (4 and 16) are based on our test environment. You may need to adjust these values in your environment to achieve similar p99/p50 latency ratios. Monitor your network conditions and adjust accordingly.

Running Training

Create a DPDK configuration file (dpdk.cfg) mapping IP addresses to MAC addresses:

192.168.100.10=AA:BB:CC:DD:EE:FF
192.168.100.11=AA:BB:CC:DD:EE:00

Note

Ensure all nodes in your cluster are listed in the configuration file.

Clear Redis entries:
```
redis-cli -p 6199 FLUSHALL
```
Start background traffic for desired environment (low-tail or high-tail) as shown above
Run the training script on each node. You have two options:

Option 1: Run OptiReduce Only (Default)

./run_training.sh <MASTER_ADDR> <RANK> <NODES> <DEV> <MODEL>

Option 2: Run All Communication Schemes

RUN_ALL=1 ./run_training.sh <MASTER_ADDR> <RANK> <NODES> <DEV> <MODEL>

This will run the following schemes in order:

NCCL with Ring algorithm
NCCL with Tree algorithm
Gloo with Ring algorithm
Gloo with BCube algorithm
Gloo with Transpose algorithm
OptiReduce

Available models:

vgg19
bert
bart
roberta
gpt2

Example for a 2-node setup with Mellanox NICs:

# Run only OptiReduce
# On master node (rank 0)
./run_training.sh 192.168.1.100 0 2 ens17 bert

# On worker node (rank 1)
./run_training.sh 192.168.1.100 1 2 ens17 bert

# Run all communication schemes
# On master node (rank 0)
RUN_ALL=1 ./run_training.sh 192.168.1.100 0 2 ens17 bert

# On worker node (rank 1)
RUN_ALL=1 ./run_training.sh 192.168.1.100 1 2 ens17 bert

Parameters:

MASTER_ADDR: IP address of the master node
RANK: Node rank (0 for master, 1,2,... for workers)
NODES: Total number of nodes
DEV: Network device name (e.g., mlx5_0 for Mellanox NICs)
MODEL: One of the available models listed above

Troubleshooting

Core Allocation

OptiReduce requires at least 4 dedicated CPU cores for running
Ensure taskset -c 1-8 in run_training.sh matches your system's available cores for PyTorch
The --tr_threads_offset parameter should be set to avoid core conflicts with the PyTorch app and must not overlap with the taskset cores
Example: If using cores 1-8 for Pytorch, set --tr_threads_offset 11 to ensure thread IDs don't overlap

Timeout Settings

The --tr_timeout parameter is crucial for proper operation
Default values in the script:
- vgg19: 135
- bert: 350
- bart: 370
- roberta: 370
- gpt2: 370
You may need to adjust these based on your model size and network conditions
For detailed explanation of timeout calculations, refer to our Technical Details page

Customizing Training Parameters

You might need to modify the following parameters in run_training.sh for your specific use case:

case $MODEL in
    vgg19)
        BATCH_SIZE=128    # Adjust based on your GPU memory
        EPOCHS=150        # Increase/decrease based on model convergence
        TR_TIMEOUT=135    # Adjust based on network conditions and number of nodes
        ;;
    bert)
        BATCH_SIZE=16
        EPOCHS=5
        TR_TIMEOUT=350
        ;;
    # ... other models
esac

Results

The following table compares the iteration time (s/it) for different communication strategies, lower is better:

Model	Env	NCCL-Ring	NCCL-Tree	Ring	BCube	TAR+TCP	OptiReduce
GPT-2	1.5	1.70 s	1.52 s	2.20 s	2.45 s	2.12 s	1.39 s
	3	2.26 s	1.91 s	2.66 s	2.99 s	2.36 s	1.41 s
GPT-2-large	1.5	7.76 s	6.46 s	8.96 s	10.45 s	7.92 s	6.01 s
	3	10.12 s	9.34 s	10.60 s	10.80 s	8.48 s	6.07 s
BERT-large	1.5	5.01 s	4.24 s	6.10 s	7.30 s	5.90 s	3.76 s
	3	6.53 s	5.21 s	8.11 s	8.19 s	6.46 s	3.85 s
BART-large	1.5	4.67 s	4.07 s	6.94 s	7.72 s	5.45 s	3.80 s
	3	6.90 s	5.74 s	7.70 s	8.11 s	5.88 s	3.90 s
RoBERTa-large	1.5	4.75 s	4.15 s	6.12 s	7.64 s	5.94 s	3.87 s
	3	7.30 s	5.51 s	8.09 s	8.99 s	6.71 s	3.92 s
Llama-3.2	1.5	12.92 s	10.28 s	15.15 s	16.54 s	11.25 s	9.73 s
	3	17.28 s	15.72 s	18.84 s	21.97 s	14.59 s	9.98 s

Analysis

OptiReduce consistently outperforms all other methods across different models and environments.
Performance gains are especially significant for larger models (GPT-2-large, Llama-3.2), where OptiReduce achieves up to 40% faster iteration time in low-tail environment.
The benefits are more pronounced in multi-node environments (Env=3), where communication bottlenecks become more severe and speedups reach around 2x.

Common Issues

Performance Degradation
- Check CPU core allocation
- Verify thread offset settings
- Monitor system for other processes using assigned cores
Training Failures
- Ensure adequate timeout values
- Verify network device names
- Check Redis server is running and accessible
Network Device Issues
- Confirm correct device name (e.g., ens17)
- Check DPDK binding status
- Verify hugepages configuration