#GTC12 certainly not a FLOP, but does take a Byte.

NVIDIA’s GPU Technology Conference 2012 has been ongoing for the past two days in San Jose, California and wraps up today. For those of us who didn’t make the trip NVIDIA has done a very good job of making all the good stuff available online both live and recorded at the conference website, while twitter has also been ablaze with #GTC12.

From a HPC and GPGPU perspective some of the most interesting news came on Tuesday during NVIDIA CEO Jen-Hsun Huang’s keynote address in which he revealed the debut of Kepler based Tesla products. First the Tesla K10 – targeted at applications requiring single precision (SP) performance (e.g. seismic, oil & gas, signal, image & video processing). Secondly, later this year the Tesla K20 will arrive for DP applications. Developments in these two products mirror and ultimately will further reinforce trends which have recently become apparent in high-level system characteristics of the top supercomputers in the world, as I show towards the end of this blog post.

K10 is a single board containing two GK104 Kepler chips, like those used in the already released consumer gaming GeForce product line. K10 will provide users with three times the SP performance of the previous top of the range Tesla GPU (M2090 Fermi), but actually has inferior double precision performance.

K20 is the product that will really interest most of the HPC community and compete with Intel’s upcoming MIC architecture accelerator. K20 is likely to first appear in Blue Waters and Titan, the two 10 Petaflop+ heterogeneous machines currently under installation in the US, and become generally available in Q4. While we didn’t get the nitty gritty numbers on the K20 yet we do know it will be based on a new chip not released in any other product line so far – the Kepler GK110. NVIDIA has not provided exact performance figures but expectations are for a similar improvement in DP over M2090 as K10 has in SP, i.e. threefold, taking it to somewhere between 1.5 and 2 Teraflops of peak DP performance.

While NVIDIA held back some of the more detailed specifications of K20, it did debut some very interesting new features that the product will have which (while not related to absolute performance) are very significant in getting the best use out of the hardware. Of most interest to the HPC market will be Dynamic Parallelism, Hyper-Q and RDMA GPU Direct.

Before the annoucement of “RDMA GPU Direct” this week GPU Direct was already an overloaded term referring to two separate contexts:

  • Intra-node memory copies between multiple GPUs in a single node without the involvement of the CPU or system memory.
  • Inter-node communication between multiple GPUs in separate nodes in a heterogeneous cluster architecture. In this instance GPU Direct referred to a feature allowing removal of a redundant copy within the CPU (host) system memory when transferring data between GPUs on separate nodes, i.e. removal of the copy represented by the red arrow in Fig. 1 below.

 

Figure 1. Pre-existing internode GPU Direct (CUDA 4.0)

Figure 1. Pre-existing internode GPU Direct (CUDA 4.0)

Note from Fig. 1 that even with this version of inter-node GPU Direct the data must still be copied from the GPU memory to the CPU memory before going out on the network.

The new RDMA GPU Direct feature annouced for the upcoming GK110 and CUDA 5.0 release apparently will remove this copy from GPU memory to CPU memory entirely allowing the NIC/IB Adapter to directly access data in the GPU Memory without any involvement of the CPU or system memory as shown in Fig 2. below.

Figure 2. RDMA GPU Direct with GK110 (CUDA 5.0)

Figure 2. RDMA GPU Direct with GK110 (CUDA 5.0)

It will be very interesting to see what improvements to latency can be delivered to applications which rely on inter-node GPU communication – an issue which becomes crucial as we enter an era where strong scaling becomes more and more important (see below!).

The other really interesting features announced at the conference for HPC were Hyper-Q and Dynamic Parallelism.

Dynamic Parallelism allows the GPU a role in creating and launching work itself rather than only the CPU being allowed to kick-off kernels on the GPU in a request-response pattern. This opens up a whole cornucopia of previously unavailable algorithmic and design avenues and cuts down on the need for CPU involvement (and thus GPU-CPU traffic) in computations.

Hyper-Q provides multiple hardware queues on the GPU which enables multiple tasks (for example instigated by multiple MPI processes running on multiple cores on the host CPU) to run concurrently on the GPU. This helps drive the efficient use of the GPU allowing multiple tasks from separate processes to occupy fractions of the GPU execution hardware at the same time.

You can find more detail on Hyper-Q, Dynamic Parallelism and RDMA GPU Direct in the GK110 whitepaper released today.

A couple of high level observations on how this fits into general HPC architecture trends. Firstly the ratio of memory capacity and memory bandwidth to compute is likely to continue to decrease, signifying the increasing necessity to make use of strong scaling in applications rather than the previously rich seam of weak scaling. K10 represents a more than 60% fall in Bytes/FLOPs (memory capacity per FLOPs) compared to M2090 and a reduction of 50% in Bytes/sec/FLOPs (memory bandwidth per FLOPs) compared to M2090 (both using SP FLOPs as per K10′s target market). It will be interesting to see what the corresponding numbers are for the upcoming K20. These figures feed into general trends (led by heterogeneous supercomputer architectures) as observed by Kogge & Dysart last year in their paper on Top 500 trends, see Fig. 3 below.

Fig. 3 Memory Capacity per FLOPs in the Top 10 Supercomputers

Fig. 3 Memory Capacity per FLOPs in the Top 10 Supercomputers. Reproduced from "Using the TOP500 to Trace and Project Technology and Architecture Trends" Kogge & Dysart, 2011.

Secondly some very welcome news. The Hyper-Q and to a lesser extent Dynamic Parallelism features are likely to lead to an increase in the efficiency of GPUs within HPC, a problem area also flagged by Kogge & Dysart in the same paper where they mentioned that the rapid decrease in efficiency evident in the heterogeneous architectures in Fig. 4 below may be due to “…the difficulty in keeping large numbers of shaders busy at the same time.”

Fig. 4 Efficiency, i.e. Sustained LINPACK performance as a percentage of peak FLOPs or Rmax/Rpeak. Reproduced from "Using the TOP500 to Trace and Project Technology and Architecture Trends" Kogge & Dysart, 2011.

Fig. 4 Efficiency, i.e. Sustained LINPACK performance as a percentage of peak FLOPs or Rmax/Rpeak. Reproduced from "Using the TOP500 to Trace and Project Technology and Architecture Trends" Kogge & Dysart, 2011.

GPU to GPU communication, leave that CPU alone!

Update 04/04/2012 – Are Intel planning to put on-chip Inifiniband into a future MIC based product to address similar issues as detailed below with inter-node GPU to GPU communication? Where the network logic resides in the system hierarchy is an important design point for supercomputing on the road to Exascale and it seems to be moving closer to the compute all the time. For example, IBM’s Blue Gene/Q with design involvment from Dr. Peter Boyle at University of Edinburgh integrates the network interface and routing (5D torus) directly within the CPU die. BG/Q currently occupies all top 5 positions in the Green 500 list of most power efficient supercomputers worldwide.

The burgeoning trend for GPGPU accelerators in the most powerful Supercomputers (see 2nd, 4th and 5th on the latest Top 500 list) looks set to continue in the medium term at least. Two new Cray XK6 CPU/GPU machines (Titan and Blue Waters) are due in the US this year. Both are targeted to outperform the current No. 1 on the Top 500 list. Further strengthening the case for GPGPU in HPC, NVIDIA’s next generation Kepler architecture (on a 28nm process shrink) was released in the consumer gaming market last week. Judging by reviews of the architecture significant improvements in the performance per watt ratio for GPGPU in HPC seem only months away when Kepler is incorporated into NVIDIA’s Tesla product line.

In light of these developments I found a talk from Professor Taisuke Boku of the University of Tsukuba, Japan at the recent joint EPCC/Tsukuba Exascale Computing Symposium in Edinburgh very interesting.

Professor Boku is Deputy Director of the Tsukuba Centre for Computational Sciences and leader of the Project for Exascale Computing Systems Development. Tsukuba have just brought online a new 800 Teraflop heterogeneous Intel Xeon/NVIDIA Tesla Supercomputer. They have some very interesting plans for the next phase of this machine which will see its performance break the 1 Petaflop barrier. This next phase will see the initial machine (named HA-PACS) augmented with the addition of a Tightly Coupled Accelerator (TCA) GPU Cluster. TCA is Tsukuba’s novel proposal for inter-node GPU to GPU communication within heterogeneous supercomputer clusters. It enables direct inter-node GPU communications by addition of a second (i.e. separate from the existing Infiniband) inter-node network within the cluster.

The trend towards heterogeneous HPC architectures including GPUs poses many challenges for HPC Systemware and Software Developers in making efficient use of the resources.

One obvious area of concern for existing multi-petaflop heterogeneous CPU/GPU architectures is the communication latency between co-operating GPUs in different nodes. There is no way for two GPUs to exchange data without the involvement of the CPU and its attached system memory. The only path available to ferry data between GPUs in different nodes is by first copying the data from the GPU memory to the CPU memory within the node before communication can take place as normal (e.g. via MPI) over the system interconnect.

Figure 1. GPU-GPU data transfer via CPU

Figure 1. The model used by existing CPU-GPU Heterogeneous architectures for GPU-GPU communication. Data travels via CPU & Infiniband (IB) Host Channel Adapter (HCA) and Switch or other proprietary interconnect.

Recent attempts to address the GPU to GPU communication overhead in clusters have included GPUDirect, a joint design from Mellanox and NVIDIA to remove a redundant memory copy which is otherwise required within the CPU memory. Even after eliminating this redundant copy the data must still be copied from the GPU to the CPU memory (and vice-versa at the receiving node).

There is also recent work to provide an optimised MPI interface directly from within CUDA which has shown some promising results in reducing GPU to GPU latency while still not approaching CPU to CPU MPI latency. This approach is again optimising the software rather than fundamentally changing the underlying route for GPU-GPU communication.

This circuitous route leads to what Tsukuba believe is unnecessary latency in the GPU to GPU pathway. As we look ahead to the Exascale era and beyond we can forsee GPU clustering being taken to extreme scale. This will make strong scaling all the more important for certain GPU enabled applications which will no longer take advantage of weak scaling at Exascale and beyond. Such applications will be particularly latency sensitive making improvements in inter-node GPU to GPU communication vital.

Tsukuba’s TCA concept is enabled by a network design which they call PEARL (PCI-Express Adaptive and Reliable Link). The PEARL network is enabled by an FPGA communications chip being developed at Tsukuba called PEACH2 (PCI-Express Adaptive Communication Hub 2).

The PEACH2 chip sits on the PCI-Express bus connecting the GPU to a CPU and has power adaptive technology designed in from the ground up. PCI-Express operates on a master/slave protocol with a CPU usually at the master end and various peripherals (such as GPUs) at the slave end. The PEACH2 chip enables connecting slave devices (in this case GPUs in different nodes) directly to each other. The GPUs themselves thus require no modification for this slave-slave connection.

Figure 2. TCA Data Transfer GPU-GPU

Figure 2. Data transfer between cooperating GPUs in separate nodes in a TCA cluster enabled by the PEACH2 chip.

As can be seen below in Figure 3. the PEARL network constitutes a second inter-node network in the cluster, parallel to the existing Inifiniband network.

Figure 3. Schematic of the PEARL network within a CPU/GPU cluster.

Figure 3. Schematic of the PEARL network within a CPU/GPU cluster.

The first 800 Teraflop phase of the HA-PACS cluster which Tsukuba launched last month is a regular CPU/GPU architecture. Tsukuba hope to have development and testing of the PEACH2 FPGA chip finished this year and to add the second (TCA) phase to HA-PACS in 2013 with 200 – 300 Teraflops of performance utilising the PEARL network.

Figure 4. Schematic of the HA-PACS supercomputer with detail on the phase 2 TCA cluster due 2013.

Figure 4. Schematic of the HA-PACS supercomputer with detail on the phase 2 TCA cluster due 2013.

This is some of the most novel work happening on GPU clustering on the path to Exascale and Professor Boku mentioned that Tsukuba were in consultation with NVIDIA about their work (TrueGPUDirect anyone?). It will be very interesting to see what kind of performance advantages over traditional CPU/GPU cluster architectures can be gained when the TCA cluster comes online next year.

Anatomy of a Supercomputer, Part 1: Meet HECToR

The department I study in at University of Edinburgh manages the largest supercomputer in the UK on behalf of academic and industry users throughout Europe. The computer in question, HECToR (High End Computing Terascale Resource) was recently upgraded to it’s “Phase 3″ incarnation. The upgraded machine is a Cray XE6 and currently sits at number 19 (660 Teraflops of LINPACK performance) on the Top 500 list of most powerful computers in the world.

I’m going to give a whistlestop tour of the machine’s anatomy divided over two posts. In this post, Part 1, I will cover the structure of the individual processors from within a single core up to the physical package which fits into a socket on a system board. In Part 2 I will then continue from the level of the single socket up to the system as a whole which ultimately consists of 90,112 cores.

HECToR, like all Cray XE6s is a homogeneous system, i.e. all the processing cores are of the same type. Cray also sell a sister system called the XK6 which is heterogeneous, swapping half the CPUs in the XE6 design for Nvidia GPUs. Both systems use Cray’s award winning proprietary Gemini Interconnect for high-bandwidth low-latency system wide communication.

In HECToR all the processors are AMD‘s recently released Interlagos variant of the Bulldozer x86 CPU architecture running at 2.3Ghz. This particular processor architecture is quite a unique and distinctive design for an x86 CPU as it follows a Cluster Multithreading (CMT) philosophy to increase hardware thread concurrency compared to the Chip Multi-Processor (CMP) and Simultaneous Multithreading (SMT) approaches common in Intel and previous AMD x86 processors. The Interlagos processor has the highest core count of any commercially available processor at 16 CMT cores, providing 16 hardware threads (Intel’s Xeon E7 provides a higher hardware thread count of 20 with 10 SMT cores). Interlagos is aimed specifically at data centre server type workloads with a nod to the comparatively niche (i.e. small revenue) High Performance Computing (HPC) market.

Starting at the worm’s eye view we have a single core in the multi-core Interlagos processor. The cores in the Bulldozer architecture are not standalone units but come in indivisible pairs only, called modules. While the two cores in a module still run two hardware threads simultaneously they share some physical components which traditionally would be replicated in every core. This partly shared two-core design presents interesting flexibility to software in how to use the resources available across the two cores.

Figure 1. Bulldozer module (2 cores)

Figure 1. Bulldozer module (2 cores)

In Fig. 1 we see the two cores with their own integer schedulers, integer arithmetic pipelines and L1 data caches. The elements highlighted in red are shared (including 2MBs L2 cache and fetch/decode hardware) among the two cores in the module. In a more traditional design these elements would be split straight down the middle to give two independent standalone cores. The main motivation for this partly shared design is that instead of each core only having access to its own 128 bit floating point Fused Multiply-Accumulate (FMAC) unit (centre of Fig. 1) either core can requisition both 128 bit FMAC units to execute a single 256 bit floating point instruction as shown in Fig. 2 below. The opportunities this flexibility presents for better performing software are another day’s discussion.

Figure 2. The floating point unit in a Bulldozer module (also visible at centre of Fig. 1)

Figure 2. The floating point unit in a Bulldozer module (also visible at centre of Fig. 1)

There is quite a way to go up the system hierarchy once we have covered this smallest unit. In the Interlagos processor four modules (i.e. 4 x Fig 1.) are first aggregated into what AMD, from a hardware perspective, call an Orochi die. From a programmer’s logical perspective the physical Orochi die is referred to as a Non-Uniform Memory Access (NUMA) region.

Figure 3. Orochi die/NUMA region

Figure 3. Orochi die/NUMA region (4 modules, 8 cores)

As can be seen from Fig. 3 each NUMA region contains four Bulldozer modules plus 2MBs of Level 3 cache for each Bulldozer module and a single DRAM interface to connect the NUMA region to some main memory. The DRAM interface is what makes this collection of components a distinct logical entity from a software perspective when multiple NUMA regions are aggregated into a single package. In HECToR there is 8GBs of DRAM connected to each NUMA region (resulting in a ratio of 1GB of main memory per core).

At the next step up the system hierarchy AMD takes two of these NUMA regions (i.e. 2 x Fig. 3) and puts them together side by side in a single physical package as shown in Fig. 4 below. This physical package I refer to as a socket (well it fits into one, and allows me to avoid the term processor which is too vague in the current context). So with each of the two NUMA regions in a socket connected to their own 8GB of DRAM the purpose of the NUMA region distinction becomes clear. Each region has faster access to its own 8GB of DRAM than it does to its twin NUMA region’s DRAM – with associated performance implications for software running across multiple cores in multiple NUMA regions.

Figure 4. AMD Opteron package for socket G34 (2 NUMA regions, 8 modules, 16 cores)

Figure 4. AMD Opteron package for socket G34 (2 NUMA regions, 8 modules, 16 cores)

This is the point where we leave the world of AMD’s commodity products behind and enter the system level where specialist HPC vendors like Cray take over. Their expertise lies in turning thousands of these commodity parts into a single coherent massively parallel High Performance Computer. A product which they may only sell something in the region of low double digit numbers of each year worldwide.

In Part 2 I will continue the tour from the socket level (Fig. 4) up to the system as a whole. Along the way we’ll see how Cray combine sockets to form nodes and use their proprietary Gemini Interconnect to build out the system to thousands of nodes. Nodes are packaged into blades, blades into chassis and chassis into cabinets.

Then there is only one question left to answer; how many cabinets can you afford?

Figure 5. "Cielo", a 96 cabinet Cray XE6 belonging to the National Nuclear Security Administration at Los Alamos National Laboratory, USA

Figure 5. "Cielo", a 96 cabinet Cray XE6 belonging to the National Nuclear Security Administration at Los Alamos National Laboratory, USA

 

Seeing Galaxies in Parallel

Following my first week of lectures in the MSc in High Performance Computing at the University of Edinburgh (which went great, still some module choices to nail down this week), I went to visit the Royal Observatory of Edinburgh.

The Observatory happens to be on a hill right behind where I live in Blackford which was a nice surprise, I hope to go back soon for one of their star-gazing Friday nights where they give members of the public the chance to peer at the sky through a 10″ Meades reflecting telescope among others.

The highlight of my first visit yesterday was coming up close and personal with the K-band Multi-Object Spectrometer (KMOS) detector which has been assembled at the observatory and will shortly be shipped to the European Southern Observatory’s Very Large Telescope (VLT) array in Chile. The KMOS detector is a large piece of optical hardware the size of a small bus which will be fitted to one of the telescopes at the VLT. The KMOS detector is designed to observe up to 24 individual objects simultaneously. Because it is detecting objects in the near-infra red the entire KMOS package is effectively a large deep freezer, or cryostat (maintained at an internal temperature below minus 130 degrees celsius) containing 24 robotic arms, each fitted with a tiny mirror the size of a thumb-tack head. The intense cold prevents any infra red radiation from the detector itself interfering or masking the near-infra red light collected by the telescope.

The 24 robotic arms position their tiny mirrors within the focal plane of the telescope and pick out the 24 objects in the field of view to be observed simultaneously. The main scientific goal of the detector is to study and better understand the formation of highly red-shifted galaxy clusters, i.e. galaxies which formed in the very early beginnings of our universe.

The KMOS detector is expected to ship to the VLT in Chile next March, so a very exciting time for all involved I’m sure. Check out this stunning time lapse film of the VLT array in Chile in action under some beautiful night skies…

About to start the MSc. in High Performance Computing at Edinburgh University

I’m really excited in advance of moving to Edinburgh this weekend to start the 12 month MSc. in High Performance Computing at EPCC within the School of Physics and Astronomy at the University of Edinburgh.

I’m really looking forward to returning to full-time education haven taken a bit of a detour since I completed my Bachelor Degree in Computer Applications at DCU in 2002. EPCC is one of  Europe’s leading HPC centres and is host to the UK’s most powerful supercomputer,  HECToR which is a Cray XE6 with 44,544 cores. So far I’ve been very impressed to learn about the wealth of experience the people at EPCC have in the areas of HPC Research, collaborating with and providing HPC resources to industry and of course in teaching the next generation of HPC practitioners via the MSc.

I hope to write soon about some of the activities which are going on at EPCC that have caught my eye already.