GPU to GPU communication, leave that CPU alone!

Update 04/04/2012 – Are Intel planning to put on-chip Inifiniband into a future MIC based product to address similar issues as detailed below with inter-node GPU to GPU communication? Where the network logic resides in the system hierarchy is an important design point for supercomputing on the road to Exascale and it seems to be moving closer to the compute all the time. For example, IBM’s Blue Gene/Q with design involvment from Dr. Peter Boyle at University of Edinburgh integrates the network interface and routing (5D torus) directly within the CPU die. BG/Q currently occupies all top 5 positions in the Green 500 list of most power efficient supercomputers worldwide.

The burgeoning trend for GPGPU accelerators in the most powerful Supercomputers (see 2nd, 4th and 5th on the latest Top 500 list) looks set to continue in the medium term at least. Two new Cray XK6 CPU/GPU machines (Titan and Blue Waters) are due in the US this year. Both are targeted to outperform the current No. 1 on the Top 500 list. Further strengthening the case for GPGPU in HPC, NVIDIA’s next generation Kepler architecture (on a 28nm process shrink) was released in the consumer gaming market last week. Judging by reviews of the architecture significant improvements in the performance per watt ratio for GPGPU in HPC seem only months away when Kepler is incorporated into NVIDIA’s Tesla product line.

In light of these developments I found a talk from Professor Taisuke Boku of the University of Tsukuba, Japan at the recent joint EPCC/Tsukuba Exascale Computing Symposium in Edinburgh very interesting.

Professor Boku is Deputy Director of the Tsukuba Centre for Computational Sciences and leader of the Project for Exascale Computing Systems Development. Tsukuba have just brought online a new 800 Teraflop heterogeneous Intel Xeon/NVIDIA Tesla Supercomputer. They have some very interesting plans for the next phase of this machine which will see its performance break the 1 Petaflop barrier. This next phase will see the initial machine (named HA-PACS) augmented with the addition of a Tightly Coupled Accelerator (TCA) GPU Cluster. TCA is Tsukuba’s novel proposal for inter-node GPU to GPU communication within heterogeneous supercomputer clusters. It enables direct inter-node GPU communications by addition of a second (i.e. separate from the existing Infiniband) inter-node network within the cluster.

The trend towards heterogeneous HPC architectures including GPUs poses many challenges for HPC Systemware and Software Developers in making efficient use of the resources.

One obvious area of concern for existing multi-petaflop heterogeneous CPU/GPU architectures is the communication latency between co-operating GPUs in different nodes. There is no way for two GPUs to exchange data without the involvement of the CPU and its attached system memory. The only path available to ferry data between GPUs in different nodes is by first copying the data from the GPU memory to the CPU memory within the node before communication can take place as normal (e.g. via MPI) over the system interconnect.

Figure 1. GPU-GPU data transfer via CPU

Figure 1. The model used by existing CPU-GPU Heterogeneous architectures for GPU-GPU communication. Data travels via CPU & Infiniband (IB) Host Channel Adapter (HCA) and Switch or other proprietary interconnect.

Recent attempts to address the GPU to GPU communication overhead in clusters have included GPUDirect, a joint design from Mellanox and NVIDIA to remove a redundant memory copy which is otherwise required within the CPU memory. Even after eliminating this redundant copy the data must still be copied from the GPU to the CPU memory (and vice-versa at the receiving node).

There is also recent work to provide an optimised MPI interface directly from within CUDA which has shown some promising results in reducing GPU to GPU latency while still not approaching CPU to CPU MPI latency. This approach is again optimising the software rather than fundamentally changing the underlying route for GPU-GPU communication.

This circuitous route leads to what Tsukuba believe is unnecessary latency in the GPU to GPU pathway. As we look ahead to the Exascale era and beyond we can forsee GPU clustering being taken to extreme scale. This will make strong scaling all the more important for certain GPU enabled applications which will no longer take advantage of weak scaling at Exascale and beyond. Such applications will be particularly latency sensitive making improvements in inter-node GPU to GPU communication vital.

Tsukuba’s TCA concept is enabled by a network design which they call PEARL (PCI-Express Adaptive and Reliable Link). The PEARL network is enabled by an FPGA communications chip being developed at Tsukuba called PEACH2 (PCI-Express Adaptive Communication Hub 2).

The PEACH2 chip sits on the PCI-Express bus connecting the GPU to a CPU and has power adaptive technology designed in from the ground up. PCI-Express operates on a master/slave protocol with a CPU usually at the master end and various peripherals (such as GPUs) at the slave end. The PEACH2 chip enables connecting slave devices (in this case GPUs in different nodes) directly to each other. The GPUs themselves thus require no modification for this slave-slave connection.

Figure 2. TCA Data Transfer GPU-GPU

Figure 2. Data transfer between cooperating GPUs in separate nodes in a TCA cluster enabled by the PEACH2 chip.

As can be seen below in Figure 3. the PEARL network constitutes a second inter-node network in the cluster, parallel to the existing Inifiniband network.

Figure 3. Schematic of the PEARL network within a CPU/GPU cluster.

Figure 3. Schematic of the PEARL network within a CPU/GPU cluster.

The first 800 Teraflop phase of the HA-PACS cluster which Tsukuba launched last month is a regular CPU/GPU architecture. Tsukuba hope to have development and testing of the PEACH2 FPGA chip finished this year and to add the second (TCA) phase to HA-PACS in 2013 with 200 – 300 Teraflops of performance utilising the PEARL network.

Figure 4. Schematic of the HA-PACS supercomputer with detail on the phase 2 TCA cluster due 2013.

Figure 4. Schematic of the HA-PACS supercomputer with detail on the phase 2 TCA cluster due 2013.

This is some of the most novel work happening on GPU clustering on the path to Exascale and Professor Boku mentioned that Tsukuba were in consultation with NVIDIA about their work (TrueGPUDirect anyone?). It will be very interesting to see what kind of performance advantages over traditional CPU/GPU cluster architectures can be gained when the TCA cluster comes online next year.