#GTC12 certainly not a FLOP, but does take a Byte.

NVIDIA’s GPU Technology Conference 2012 has been ongoing for the past two days in San Jose, California and wraps up today. For those of us who didn’t make the trip NVIDIA has done a very good job of making all the good stuff available online both live and recorded at the conference website, while twitter has also been ablaze with #GTC12.

From a HPC and GPGPU perspective some of the most interesting news came on Tuesday during NVIDIA CEO Jen-Hsun Huang’s keynote address in which he revealed the debut of Kepler based Tesla products. First the Tesla K10 – targeted at applications requiring single precision (SP) performance (e.g. seismic, oil & gas, signal, image & video processing). Secondly, later this year the Tesla K20 will arrive for DP applications. Developments in these two products mirror and ultimately will further reinforce trends which have recently become apparent in high-level system characteristics of the top supercomputers in the world, as I show towards the end of this blog post.

K10 is a single board containing two GK104 Kepler chips, like those used in the already released consumer gaming GeForce product line. K10 will provide users with three times the SP performance of the previous top of the range Tesla GPU (M2090 Fermi), but actually has inferior double precision performance.

K20 is the product that will really interest most of the HPC community and compete with Intel’s upcoming MIC architecture accelerator. K20 is likely to first appear in Blue Waters and Titan, the two 10 Petaflop+ heterogeneous machines currently under installation in the US, and become generally available in Q4. While we didn’t get the nitty gritty numbers on the K20 yet we do know it will be based on a new chip not released in any other product line so far – the Kepler GK110. NVIDIA has not provided exact performance figures but expectations are for a similar improvement in DP over M2090 as K10 has in SP, i.e. threefold, taking it to somewhere between 1.5 and 2 Teraflops of peak DP performance.

While NVIDIA held back some of the more detailed specifications of K20, it did debut some very interesting new features that the product will have which (while not related to absolute performance) are very significant in getting the best use out of the hardware. Of most interest to the HPC market will be Dynamic Parallelism, Hyper-Q and RDMA GPU Direct.

Before the annoucement of “RDMA GPU Direct” this week GPU Direct was already an overloaded term referring to two separate contexts:

  • Intra-node memory copies between multiple GPUs in a single node without the involvement of the CPU or system memory.
  • Inter-node communication between multiple GPUs in separate nodes in a heterogeneous cluster architecture. In this instance GPU Direct referred to a feature allowing removal of a redundant copy within the CPU (host) system memory when transferring data between GPUs on separate nodes, i.e. removal of the copy represented by the red arrow in Fig. 1 below.

 

Figure 1. Pre-existing internode GPU Direct (CUDA 4.0)

Figure 1. Pre-existing internode GPU Direct (CUDA 4.0)

Note from Fig. 1 that even with this version of inter-node GPU Direct the data must still be copied from the GPU memory to the CPU memory before going out on the network.

The new RDMA GPU Direct feature annouced for the upcoming GK110 and CUDA 5.0 release apparently will remove this copy from GPU memory to CPU memory entirely allowing the NIC/IB Adapter to directly access data in the GPU Memory without any involvement of the CPU or system memory as shown in Fig 2. below.

Figure 2. RDMA GPU Direct with GK110 (CUDA 5.0)

Figure 2. RDMA GPU Direct with GK110 (CUDA 5.0)

It will be very interesting to see what improvements to latency can be delivered to applications which rely on inter-node GPU communication – an issue which becomes crucial as we enter an era where strong scaling becomes more and more important (see below!).

The other really interesting features announced at the conference for HPC were Hyper-Q and Dynamic Parallelism.

Dynamic Parallelism allows the GPU a role in creating and launching work itself rather than only the CPU being allowed to kick-off kernels on the GPU in a request-response pattern. This opens up a whole cornucopia of previously unavailable algorithmic and design avenues and cuts down on the need for CPU involvement (and thus GPU-CPU traffic) in computations.

Hyper-Q provides multiple hardware queues on the GPU which enables multiple tasks (for example instigated by multiple MPI processes running on multiple cores on the host CPU) to run concurrently on the GPU. This helps drive the efficient use of the GPU allowing multiple tasks from separate processes to occupy fractions of the GPU execution hardware at the same time.

You can find more detail on Hyper-Q, Dynamic Parallelism and RDMA GPU Direct in the GK110 whitepaper released today.

A couple of high level observations on how this fits into general HPC architecture trends. Firstly the ratio of memory capacity and memory bandwidth to compute is likely to continue to decrease, signifying the increasing necessity to make use of strong scaling in applications rather than the previously rich seam of weak scaling. K10 represents a more than 60% fall in Bytes/FLOPs (memory capacity per FLOPs) compared to M2090 and a reduction of 50% in Bytes/sec/FLOPs (memory bandwidth per FLOPs) compared to M2090 (both using SP FLOPs as per K10′s target market). It will be interesting to see what the corresponding numbers are for the upcoming K20. These figures feed into general trends (led by heterogeneous supercomputer architectures) as observed by Kogge & Dysart last year in their paper on Top 500 trends, see Fig. 3 below.

Fig. 3 Memory Capacity per FLOPs in the Top 10 Supercomputers

Fig. 3 Memory Capacity per FLOPs in the Top 10 Supercomputers. Reproduced from "Using the TOP500 to Trace and Project Technology and Architecture Trends" Kogge & Dysart, 2011.

Secondly some very welcome news. The Hyper-Q and to a lesser extent Dynamic Parallelism features are likely to lead to an increase in the efficiency of GPUs within HPC, a problem area also flagged by Kogge & Dysart in the same paper where they mentioned that the rapid decrease in efficiency evident in the heterogeneous architectures in Fig. 4 below may be due to “…the difficulty in keeping large numbers of shaders busy at the same time.”

Fig. 4 Efficiency, i.e. Sustained LINPACK performance as a percentage of peak FLOPs or Rmax/Rpeak. Reproduced from "Using the TOP500 to Trace and Project Technology and Architecture Trends" Kogge & Dysart, 2011.

Fig. 4 Efficiency, i.e. Sustained LINPACK performance as a percentage of peak FLOPs or Rmax/Rpeak. Reproduced from "Using the TOP500 to Trace and Project Technology and Architecture Trends" Kogge & Dysart, 2011.