Preface: Training a Trillion-Parameter model is not a compute problem; it is a communication problem. As we scale from 16 GPUs to 16,000 GPUs, the time spent in `AllReduce` synchronization begins to dominate the total training time. This guide explores the extreme networking optimizations required to keep H100 clusters fed.
1. Topologies: Fat Tree vs. Torus
The physical layout of cabling determines the bisection bandwidth—the maximum bandwidth available between any two halves of the network.
The Fat Tree (Clos) Architecture
In a non-blocking Fat Tree, bandwidth scales linearly. For 1,000 nodes at 400Gbps, we need a multi-tier spine-leaf architecture.
- Tier 1 (ToR/Leaf): Connects directly to servers.
- Tier 2 (Spine): Connects leafs.
- Tier 3 (Super-Spine): Connects spines (for massive clusters).
The goal is 1:1 Oversubscription Ratio. If 32 servers connect to a Leaf at 400Gbps, the Leaf must have 32 uplinks at 400Gbps to the Spine. Anything less creates a bottleneck during All-to-All operations.
2. RDMA, Verbs, and Kernel Bypassing
Standard TCP/IP stacks incur microseconds of latency due to context switching and buffer copying. Remote Direct Memory Access (RDMA) allows the NIC to write directly into the application memory of a remote machine using the PCIe bus.
We interact with RDMA hardware using **libibverbs**. The CPU prepares a "Work Queue Element" (WQE) and rings a doorbell on the NIC. The NIC handles the rest.
3. Implementing RoCE v2 (Lossless Ethernet)
InfiniBand is the gold standard, but Ethernet is ubiquitous. RoCE v2 encapsulates IB transport headers inside UDP/IP packets.
The Packet Structure:
- Ethernet Header: MAC src/dst, EtherType (IPv4/IPv6).
- IP Header: ECN bits, DSCP marking.
- UDP Header: Dest Port 4791 (RoCE v2).
- IB BTH (Base Transport Header): Opcode, Partition Key.
- Payload: The actual gradient data.
- ICRC: Invariant CRC checksum.
Priority Flow Control (PFC)
RoCE requires a lossless medium. If a switch buffer fills up, it must not drop the packet (which kills throughput). Instead, it sends a PAUSE frame to the sender.
# Arista EOS Configuration for Lossless RoCE
# 1. Map DSCP 26 (CS3) to Traffic Class 3
qos map dscp 26 to traffic-class 3
# 2. Enable PFC on Interface
interface Ethernet1/1
priority-flow-control mode on
priority-flow-control watch 3
# 3. Configure ECN (Explicit Congestion Notification)
random-detect ecn
random-detect cos 3 min-threshold 20% max-threshold 80% mark-prob 10
Warning: PFC Storms. If a NIC malfunctions and sends PAUSE frames continuously, it can freeze the entire network (Head-of-Line Blocking). We must configure PFC Watchdog to detect and drop traffic from "stuck" queues after a timeout (e.g., 500ms).
4. Congestion Control (DCQCN)
PFC is a coarse axe; it stops everything. DCQCN (Data Center Quantized Congestion Notification) is a scalpel. It is a reaction point algorithm similar to TCP Cubic but implemented in hardware.
- ECN Marking: Switches mark the IP header (ECN=11) when buffers exceed a threshold.
- CNP Generation: The Receiver NIC sees the marking and sends a Congestion Notification Packet (CNP) back to the Sender.
- Rate Limiting: The Sender NIC reduces its transmission rate for that specific Queue Pair (QP).
5. GPUDirect & NCCL Tuning
The NVIDIA Collective Communications Library (NCCL) is the library that orchestrates communication between GPUs. Optimizing NCCL is "Black Magic".
Environment Variable Tuning Guide
| Variable | Recommended | Explanation |
|---|---|---|
NCCL_IB_HCA |
=mlx5_0,mlx5_3... |
Explicitly bind NICs closest to the GPU via PCIe root complex (NUMA awareness). |
NCCL_NET_GDR_LEVEL |
2 |
Force GPUDirect RDMA. Data goes GPU Mem -> NIC. |
NCCL_ALGO |
Ring or Tree |
Use Ring for bandwidth-bound, Tree for latency-bound (small mesg). |
NCCL_BUFFSIZE |
4194304 (4MB) |
Increase buffer size to amortize overhead on high-latency links. |
6. Telemetry & Monitoring
You cannot optimize what you cannot measure. We use IPFIX and Streaming Telemetry to sample headers at line rate.
Key Metrics to Alert On:
- PFC Pause Frames Rx/Tx: Any non-zero value implies congestion. Sustained/Millions implies a problem.
- CNP Packets Sent: Indicates DCQCN is active.
- ECN Marked Packets: Switch buffers are filling up.
- Link Flaps: Physical layer signal integrity issues (bad cables are common at 400Gbps).
Conclusion: Building a network for LLMs requires abandoning the "Best Effort" principles of the internet. We build specialized, lossless, highly-tuned fabrics where every microsecond of tail latency is hunted down and eliminated.