The department I study in at University of Edinburgh manages the largest supercomputer in the UK on behalf of academic and industry users throughout Europe. The computer in question, HECToR (High End Computing Terascale Resource) was recently upgraded to it’s “Phase 3” incarnation. The upgraded machine is a Cray XE6 and currently sits at number 19 (660 Teraflops of LINPACK performance) on the Top 500 list of most powerful computers in the world.
I’m going to give a whistlestop tour of the machine’s anatomy divided over two posts. In this post, Part 1, I will cover the structure of the individual processors from within a single core up to the physical package which fits into a socket on a system board. In Part 2 I will then continue from the level of the single socket up to the system as a whole which ultimately consists of 90,112 cores.
HECToR, like all Cray XE6s is a homogeneous system, i.e. all the processing cores are of the same type. Cray also sell a sister system called the XK6 which is heterogeneous, swapping half the CPUs in the XE6 design for Nvidia GPUs. Both systems use Cray’s award winning proprietary Gemini Interconnect for high-bandwidth low-latency system wide communication.
In HECToR all the processors are AMD‘s recently released Interlagos variant of the Bulldozer x86 CPU architecture running at 2.3Ghz. This particular processor architecture is quite a unique and distinctive design for an x86 CPU as it follows a Cluster Multithreading (CMT) philosophy to increase hardware thread concurrency compared to the Chip Multi-Processor (CMP) and Simultaneous Multithreading (SMT) approaches common in Intel and previous AMD x86 processors. The Interlagos processor has the highest core count of any commercially available processor at 16 CMT cores, providing 16 hardware threads (Intel’s Xeon E7 provides a higher hardware thread count of 20 with 10 SMT cores). Interlagos is aimed specifically at data centre server type workloads with a nod to the comparatively niche (i.e. small revenue) High Performance Computing (HPC) market.
Starting at the worm’s eye view we have a single core in the multi-core Interlagos processor. The cores in the Bulldozer architecture are not standalone units but come in indivisible pairs only, called modules. While the two cores in a module still run two hardware threads simultaneously they share some physical components which traditionally would be replicated in every core. This partly shared two-core design presents interesting flexibility to software in how to use the resources available across the two cores.
In Fig. 1 we see the two cores with their own integer schedulers, integer arithmetic pipelines and L1 data caches. The elements highlighted in red are shared (including 2MBs L2 cache and fetch/decode hardware) among the two cores in the module. In a more traditional design these elements would be split straight down the middle to give two independent standalone cores. The main motivation for this partly shared design is that instead of each core only having access to its own 128 bit floating point Fused Multiply-Accumulate (FMAC) unit (centre of Fig. 1) either core can requisition both 128 bit FMAC units to execute a single 256 bit floating point instruction as shown in Fig. 2 below. The opportunities this flexibility presents for better performing software are another day’s discussion.
There is quite a way to go up the system hierarchy once we have covered this smallest unit. In the Interlagos processor four modules (i.e. 4 x Fig 1.) are first aggregated into what AMD, from a hardware perspective, call an Orochi die. From a programmer’s logical perspective the physical Orochi die is referred to as a Non-Uniform Memory Access (NUMA) region.
As can be seen from Fig. 3 each NUMA region contains four Bulldozer modules plus 2MBs of Level 3 cache for each Bulldozer module and a single DRAM interface to connect the NUMA region to some main memory. The DRAM interface is what makes this collection of components a distinct logical entity from a software perspective when multiple NUMA regions are aggregated into a single package. In HECToR there is 8GBs of DRAM connected to each NUMA region (resulting in a ratio of 1GB of main memory per core).
At the next step up the system hierarchy AMD takes two of these NUMA regions (i.e. 2 x Fig. 3) and puts them together side by side in a single physical package as shown in Fig. 4 below. This physical package I refer to as a socket (well it fits into one, and allows me to avoid the term processor which is too vague in the current context). So with each of the two NUMA regions in a socket connected to their own 8GB of DRAM the purpose of the NUMA region distinction becomes clear. Each region has faster access to its own 8GB of DRAM than it does to its twin NUMA region’s DRAM – with associated performance implications for software running across multiple cores in multiple NUMA regions.
This is the point where we leave the world of AMD’s commodity products behind and enter the system level where specialist HPC vendors like Cray take over. Their expertise lies in turning thousands of these commodity parts into a single coherent massively parallel High Performance Computer. A product which they may only sell something in the region of low double digit numbers of each year worldwide.
In Part 2 I will continue the tour from the socket level (Fig. 4) up to the system as a whole. Along the way we’ll see how Cray combine sockets to form nodes and use their proprietary Gemini Interconnect to build out the system to thousands of nodes. Nodes are packaged into blades, blades into chassis and chassis into cabinets.
Then there is only one question left to answer; how many cabinets can you afford?