#GTC12 certainly not a FLOP, but does take a Byte.

NVIDIA’s GPU Technology Conference 2012 has been ongoing for the past two days in San Jose, California and wraps up today. For those of us who didn’t make the trip NVIDIA has done a very good job of making all the good stuff available online both live and recorded at the conference website, while twitter has also been ablaze with #GTC12.

From a HPC and GPGPU perspective some of the most interesting news came on Tuesday during NVIDIA CEO Jen-Hsun Huang’s keynote address in which he revealed the debut of Kepler based Tesla products. First the Tesla K10 – targeted at applications requiring single precision (SP) performance (e.g. seismic, oil & gas, signal, image & video processing). Secondly, later this year the Tesla K20 will arrive for DP applications. Developments in these two products mirror and ultimately will further reinforce trends which have recently become apparent in high-level system characteristics of the top supercomputers in the world, as I show towards the end of this blog post.

K10 is a single board containing two GK104 Kepler chips, like those used in the already released consumer gaming GeForce product line. K10 will provide users with three times the SP performance of the previous top of the range Tesla GPU (M2090 Fermi), but actually has inferior double precision performance.

K20 is the product that will really interest most of the HPC community and compete with Intel’s upcoming MIC architecture accelerator. K20 is likely to first appear in Blue Waters and Titan, the two 10 Petaflop+ heterogeneous machines currently under installation in the US, and become generally available in Q4. While we didn’t get the nitty gritty numbers on the K20 yet we do know it will be based on a new chip not released in any other product line so far – the Kepler GK110. NVIDIA has not provided exact performance figures but expectations are for a similar improvement in DP over M2090 as K10 has in SP, i.e. threefold, taking it to somewhere between 1.5 and 2 Teraflops of peak DP performance.

While NVIDIA held back some of the more detailed specifications of K20, it did debut some very interesting new features that the product will have which (while not related to absolute performance) are very significant in getting the best use out of the hardware. Of most interest to the HPC market will be Dynamic Parallelism, Hyper-Q and RDMA GPU Direct.

Before the annoucement of “RDMA GPU Direct” this week GPU Direct was already an overloaded term referring to two separate contexts:

  • Intra-node memory copies between multiple GPUs in a single node without the involvement of the CPU or system memory.
  • Inter-node communication between multiple GPUs in separate nodes in a heterogeneous cluster architecture. In this instance GPU Direct referred to a feature allowing removal of a redundant copy within the CPU (host) system memory when transferring data between GPUs on separate nodes, i.e. removal of the copy represented by the red arrow in Fig. 1 below.

 

Figure 1. Pre-existing internode GPU Direct (CUDA 4.0)

Figure 1. Pre-existing internode GPU Direct (CUDA 4.0)

Note from Fig. 1 that even with this version of inter-node GPU Direct the data must still be copied from the GPU memory to the CPU memory before going out on the network.

The new RDMA GPU Direct feature annouced for the upcoming GK110 and CUDA 5.0 release apparently will remove this copy from GPU memory to CPU memory entirely allowing the NIC/IB Adapter to directly access data in the GPU Memory without any involvement of the CPU or system memory as shown in Fig 2. below.

Figure 2. RDMA GPU Direct with GK110 (CUDA 5.0)

Figure 2. RDMA GPU Direct with GK110 (CUDA 5.0)

It will be very interesting to see what improvements to latency can be delivered to applications which rely on inter-node GPU communication – an issue which becomes crucial as we enter an era where strong scaling becomes more and more important (see below!).

The other really interesting features announced at the conference for HPC were Hyper-Q and Dynamic Parallelism.

Dynamic Parallelism allows the GPU a role in creating and launching work itself rather than only the CPU being allowed to kick-off kernels on the GPU in a request-response pattern. This opens up a whole cornucopia of previously unavailable algorithmic and design avenues and cuts down on the need for CPU involvement (and thus GPU-CPU traffic) in computations.

Hyper-Q provides multiple hardware queues on the GPU which enables multiple tasks (for example instigated by multiple MPI processes running on multiple cores on the host CPU) to run concurrently on the GPU. This helps drive the efficient use of the GPU allowing multiple tasks from separate processes to occupy fractions of the GPU execution hardware at the same time.

You can find more detail on Hyper-Q, Dynamic Parallelism and RDMA GPU Direct in the GK110 whitepaper released today.

A couple of high level observations on how this fits into general HPC architecture trends. Firstly the ratio of memory capacity and memory bandwidth to compute is likely to continue to decrease, signifying the increasing necessity to make use of strong scaling in applications rather than the previously rich seam of weak scaling. K10 represents a more than 60% fall in Bytes/FLOPs (memory capacity per FLOPs) compared to M2090 and a reduction of 50% in Bytes/sec/FLOPs (memory bandwidth per FLOPs) compared to M2090 (both using SP FLOPs as per K10′s target market). It will be interesting to see what the corresponding numbers are for the upcoming K20. These figures feed into general trends (led by heterogeneous supercomputer architectures) as observed by Kogge & Dysart last year in their paper on Top 500 trends, see Fig. 3 below.

Fig. 3 Memory Capacity per FLOPs in the Top 10 Supercomputers

Fig. 3 Memory Capacity per FLOPs in the Top 10 Supercomputers. Reproduced from "Using the TOP500 to Trace and Project Technology and Architecture Trends" Kogge & Dysart, 2011.

Secondly some very welcome news. The Hyper-Q and to a lesser extent Dynamic Parallelism features are likely to lead to an increase in the efficiency of GPUs within HPC, a problem area also flagged by Kogge & Dysart in the same paper where they mentioned that the rapid decrease in efficiency evident in the heterogeneous architectures in Fig. 4 below may be due to “…the difficulty in keeping large numbers of shaders busy at the same time.”

Fig. 4 Efficiency, i.e. Sustained LINPACK performance as a percentage of peak FLOPs or Rmax/Rpeak. Reproduced from "Using the TOP500 to Trace and Project Technology and Architecture Trends" Kogge & Dysart, 2011.

Fig. 4 Efficiency, i.e. Sustained LINPACK performance as a percentage of peak FLOPs or Rmax/Rpeak. Reproduced from "Using the TOP500 to Trace and Project Technology and Architecture Trends" Kogge & Dysart, 2011.

GPU to GPU communication, leave that CPU alone!

Update 04/04/2012 – Are Intel planning to put on-chip Inifiniband into a future MIC based product to address similar issues as detailed below with inter-node GPU to GPU communication? Where the network logic resides in the system hierarchy is an important design point for supercomputing on the road to Exascale and it seems to be moving closer to the compute all the time. For example, IBM’s Blue Gene/Q with design involvment from Dr. Peter Boyle at University of Edinburgh integrates the network interface and routing (5D torus) directly within the CPU die. BG/Q currently occupies all top 5 positions in the Green 500 list of most power efficient supercomputers worldwide.

The burgeoning trend for GPGPU accelerators in the most powerful Supercomputers (see 2nd, 4th and 5th on the latest Top 500 list) looks set to continue in the medium term at least. Two new Cray XK6 CPU/GPU machines (Titan and Blue Waters) are due in the US this year. Both are targeted to outperform the current No. 1 on the Top 500 list. Further strengthening the case for GPGPU in HPC, NVIDIA’s next generation Kepler architecture (on a 28nm process shrink) was released in the consumer gaming market last week. Judging by reviews of the architecture significant improvements in the performance per watt ratio for GPGPU in HPC seem only months away when Kepler is incorporated into NVIDIA’s Tesla product line.

In light of these developments I found a talk from Professor Taisuke Boku of the University of Tsukuba, Japan at the recent joint EPCC/Tsukuba Exascale Computing Symposium in Edinburgh very interesting.

Professor Boku is Deputy Director of the Tsukuba Centre for Computational Sciences and leader of the Project for Exascale Computing Systems Development. Tsukuba have just brought online a new 800 Teraflop heterogeneous Intel Xeon/NVIDIA Tesla Supercomputer. They have some very interesting plans for the next phase of this machine which will see its performance break the 1 Petaflop barrier. This next phase will see the initial machine (named HA-PACS) augmented with the addition of a Tightly Coupled Accelerator (TCA) GPU Cluster. TCA is Tsukuba’s novel proposal for inter-node GPU to GPU communication within heterogeneous supercomputer clusters. It enables direct inter-node GPU communications by addition of a second (i.e. separate from the existing Infiniband) inter-node network within the cluster.

The trend towards heterogeneous HPC architectures including GPUs poses many challenges for HPC Systemware and Software Developers in making efficient use of the resources.

One obvious area of concern for existing multi-petaflop heterogeneous CPU/GPU architectures is the communication latency between co-operating GPUs in different nodes. There is no way for two GPUs to exchange data without the involvement of the CPU and its attached system memory. The only path available to ferry data between GPUs in different nodes is by first copying the data from the GPU memory to the CPU memory within the node before communication can take place as normal (e.g. via MPI) over the system interconnect.

Figure 1. GPU-GPU data transfer via CPU

Figure 1. The model used by existing CPU-GPU Heterogeneous architectures for GPU-GPU communication. Data travels via CPU & Infiniband (IB) Host Channel Adapter (HCA) and Switch or other proprietary interconnect.

Recent attempts to address the GPU to GPU communication overhead in clusters have included GPUDirect, a joint design from Mellanox and NVIDIA to remove a redundant memory copy which is otherwise required within the CPU memory. Even after eliminating this redundant copy the data must still be copied from the GPU to the CPU memory (and vice-versa at the receiving node).

There is also recent work to provide an optimised MPI interface directly from within CUDA which has shown some promising results in reducing GPU to GPU latency while still not approaching CPU to CPU MPI latency. This approach is again optimising the software rather than fundamentally changing the underlying route for GPU-GPU communication.

This circuitous route leads to what Tsukuba believe is unnecessary latency in the GPU to GPU pathway. As we look ahead to the Exascale era and beyond we can forsee GPU clustering being taken to extreme scale. This will make strong scaling all the more important for certain GPU enabled applications which will no longer take advantage of weak scaling at Exascale and beyond. Such applications will be particularly latency sensitive making improvements in inter-node GPU to GPU communication vital.

Tsukuba’s TCA concept is enabled by a network design which they call PEARL (PCI-Express Adaptive and Reliable Link). The PEARL network is enabled by an FPGA communications chip being developed at Tsukuba called PEACH2 (PCI-Express Adaptive Communication Hub 2).

The PEACH2 chip sits on the PCI-Express bus connecting the GPU to a CPU and has power adaptive technology designed in from the ground up. PCI-Express operates on a master/slave protocol with a CPU usually at the master end and various peripherals (such as GPUs) at the slave end. The PEACH2 chip enables connecting slave devices (in this case GPUs in different nodes) directly to each other. The GPUs themselves thus require no modification for this slave-slave connection.

Figure 2. TCA Data Transfer GPU-GPU

Figure 2. Data transfer between cooperating GPUs in separate nodes in a TCA cluster enabled by the PEACH2 chip.

As can be seen below in Figure 3. the PEARL network constitutes a second inter-node network in the cluster, parallel to the existing Inifiniband network.

Figure 3. Schematic of the PEARL network within a CPU/GPU cluster.

Figure 3. Schematic of the PEARL network within a CPU/GPU cluster.

The first 800 Teraflop phase of the HA-PACS cluster which Tsukuba launched last month is a regular CPU/GPU architecture. Tsukuba hope to have development and testing of the PEACH2 FPGA chip finished this year and to add the second (TCA) phase to HA-PACS in 2013 with 200 – 300 Teraflops of performance utilising the PEARL network.

Figure 4. Schematic of the HA-PACS supercomputer with detail on the phase 2 TCA cluster due 2013.

Figure 4. Schematic of the HA-PACS supercomputer with detail on the phase 2 TCA cluster due 2013.

This is some of the most novel work happening on GPU clustering on the path to Exascale and Professor Boku mentioned that Tsukuba were in consultation with NVIDIA about their work (TrueGPUDirect anyone?). It will be very interesting to see what kind of performance advantages over traditional CPU/GPU cluster architectures can be gained when the TCA cluster comes online next year.