← Back to Insights

RDMA & RoCE: The Physics of Data

Preface: In standard TCP/IP networking, the CPU is the bottleneck. Every packet must be copied from the NIC buffer to the Kernel buffer, then to the Application buffer. This "Buffer Copy Tax" adds latency and burns CPU cycles. RDMA (Remote Direct Memory Access) eliminates the taxman.

1. RDMA: Zero Copy Networking

Imagine sending a package to your neighbor.
TCP/IP: You hand it to a courier, who takes it to a sorting center, who hands it to a delivery driver, who hands it to your neighbor's receptionist.
RDMA: You teleport the package directly onto your neighbor's desk.

RDMA Memory Flow
Figure 1: Data flows directly from RAM to RAM, bypassing the CPU entirely.

By bypassing the Kernel (OS), we achieve:

2. The Doorbell Mechanism

If the Kernel isn't involved, how does the NIC know there is data to send? The CPU writes a "Work Queue Element" (WQE) to a memory address mapped to the NIC and "rings a doorbell" (sends a signal). The NIC then fetches the data via DMA (Direct Memory Access) and ships it.

3. RoCE v2 Packet Anatomy

InfiniBand (IB) was the original protocol for RDMA. But IB cables are expensive and niche. RoCE (RDMA over Converged Ethernet) allows us to run the InfiniBand protocol over standard Ethernet/IP networks.

RoCE Packet Structure
Figure 2: The RoCE v2 Packet. InfiniBand BTH travels inside a UDP envelope.

The Layers:

4. Lossless Ethernet (PFC)

RDMA assumes the connection is reliable. It does not have the sophisticated retransmission logic of TCP. If a packet is dropped, performance falls off a cliff.

To prevent drops, we use Priority Flow Control (PFC). If a switch buffer is 80% full, it screams "STOP!" (Pause Frame) to the upstream sender. The sender pauses instantly. This creates a "Lossless" fabric, but introduces the risk of "Head-of-Line Blocking" if a pause frame propagates too far.

Design Your Fabric

We optimize RoCE fabrics for massive-scale AI clusters.

Network@networkprogrammable.com