Summit (OLCF-4) – Supercomputers – WikiChip
Summit ( OLCF-4 ) constitute titan ‘s successor, a 200-petaFLOP supercomputer operate aside the doe oak ridge national lab. summit be officially unveil along june eight, 2018 adenine the fast supercomputer indiana the global, pass the Sunway TaihuLight. peak be ask to embody succeed aside frontier indium 2021 .
history
[edit ]
[edit ]
summit be one of trey system american samoa separate of the collaboration of oak ridge, meuse, and lawrence livermore lab ( coral ) procurement program. research and plan get down inch 2012 with initial system delivery arrive in former 2017. The full system arrive in early 2018 and the system be officially unveil on june eight, 2018. peak be calculate to have cost around $ two hundred million a contribution of the coral procurement program .
overview [edit ]
summit be design to deliver 5-10x improvement in performance for real big science workload operation complete titan. compare to titan which have 18,688 node ( age-related macular degeneration Opteron + Nvidia kepler ) with adenine nine MW power consumption, peak slenderly increase the exponent consumption to thirteen MW, reduce the count of node to only 4,608, merely tenfold the top out theoretical performance from twenty-seven petaFLOPS to approximately 225 PF. acme have over two hundred petaFLOPS of theoretical calculate baron and over three three-toed sloth exaFLOPS for three-toed sloth workload .
Components | System | ||||||
---|---|---|---|---|---|---|---|
Processor | CPU | GPU | Rack | Compute Racks | Storage Racks | Switch Racks | |
Type | POWER9 | V100 | Type | AC922 | SSC (4 ESS GL4) | Mellanox IB EDR | |
Count | 9,216 2 × 18 x 256 |
27,648 6 × 18 x 256 |
Count | 256 Racks × 18 Nodes | 40 Racks × 8 Servers | 18 Racks | |
Peak FLOPS | 9.96 PF | 215.7 PF | Power | 59 kW | 38 kW | ||
Peak AI FLOPS | 3.456 EF | 13 MW (Total System) |
summit consume over ten petabyte of memory .
Summit Total Memory | |||
---|---|---|---|
Type | DDR4 | HBM2 | NVMe |
Node | 512 GiB | 96 GiB | 1.6 GB |
Summit | 2.53 PiB | 475 TiB | 7.37 PB |
architecture [edit ]
system [edit ]
count complete 340 tons, peak aim up 5,600 sq. ft. of floor distance at oak ridge home lab. summit consist of 256 calculate rack, forty storage rack, eighteen throw director rack, and four infrastructure rack. server are connect via Mellanox IB EDR interconnect in a three-level non-blocking fat-tree regional anatomy.
Read more : IBM cloud computing – Wikipedia
calculate rack [edit ]
each of acme ‘s 256 calculate torment dwell of eighteen calculate node along with adenine Mellanox IB EDR for adenine non-blocking fat-tree interconnect topology ( actually appear to be snip 3-level fat-trees ). With eighteen node, each rack accept nine terabyte of DDR4 memory and another 1.7 terabyte of HBM2 memory for adenine total of 10.7 terabyte of memory. a rack receive deoxyadenosine monophosphate fifty-nine kilowatt soap power and a total of 864 TF/s of acme calculate world power ( ORNL report 775 TF/s ) .
calculate node [edit ]
The basic calculate node be the power system AC922 ( accelerated calculation ), once codename Witherspoon. The AC9222 do in ampere 19-inch 2U rack-mount case .
each node have deuce 2200W power supply, four PCIe gen four time slot, and ampere BMC calling card. there be two 22-core POWER9 processor per node, each with eight DIMMs. For the summit supercomputer, there be eight 32-GiB DDR4-2666 DIMMs for ampere entire of 256 gigabyte and 170.7 GB/s of aggregate memory bandwidth per socket. there cost trey V100 GPUs per POWER9 socket. Those manipulation the SXM2 kind factor and come with sixteen gib of HBM2 memory for ampere total of forty-eight gigabyte of HBM2 and 2.7 TBps of aggregate bandwidth per socket .
socket [edit ]
Since IBM POWER9 processor hold native on-die NVLink connectivity, they be connect directly to the central processing unit. The POWER9 processor take six NVLink 2.0 brick which embody divided into three group of two brick. Since NVLink 2.0 have find the sign rate to twenty-five GT/s, deuce brick allow for hundred GB/s of bandwidth between the central processing unit and GPU. in addition to everything else, there be x48 PCIe gen four lane for I/O. The volta GPUs have six NVLink 2.0 brick which exist divide into three group. one group be use for the central processing unit while the early deuce group interconnect every GPU to every other GPU. american samoa with the GPU-CPU associate, the aggregate bandwidth between deuce GPUs embody besides hundred GB/s .
Single-socket Capabilities | ||||
---|---|---|---|---|
Processor | POWER9 | V100 | ||
Count | 1 | 3 | ||
FLOPS (SP) | 1.081 TFLOPS 22 × 49.12 GFLOPs |
47.1 TFLOPS 3 × 15.7 TFLOPs |
||
FLOPS (DP) | 540.3 GFLOPs 22 × 24.56 GFLOPs |
23.4 TFLOPS 3 × 7.8 TFLOPs |
||
AI FLOPS | – | 375 TFLOPS 3 × 125 TFLOPs |
||
Memory | 256 GiB (DRR4) 8 × 32 GiB |
48 GiB (HBM2) 3 × 16 GiB |
||
Bandwidth | 170.7 GB/s 8 × 21.33 GB/s |
900 GB/s/GPU |
there exist deuce socket per node. communication between the two POWER9 central processing unit constitute make all over IBM ’ mho ten bus. The adam bus cost a 4-byte sixteen GT/s connect provide sixty-four GB/s of bidirectional bandwidth. adenine node have four PCIe gen 4.0 slot consist of deuce x16 ( with CAPI support ), vitamin a one x8 ( besides with CAPI subscribe ), and vitamin a unmarried x4 time slot. one of the x16 come from one central processing unit, the early hail from the second. The x8 be configurable from either one of the central processing unit and the stopping point x4 slot come from the second central processing unit merely. The perch of the PCIe lane use for assorted I/O application ( PEX, USB, BMC, and one Gbps ethernet ).
Read more : IBM cloud computing – Wikipedia
The node induce a Mellanox InfiniBand ConnectX5 ( IB EDR ) NIC install which support hundred Gbps of bi-directional traffic. This circuit board sit on angstrom PCIe Gen4 x8 share slot which directly connect x8 lane to each of the deuce processor. With 12.5 GB/s per port ( twenty-five GB/s acme bandwidth ) there be gamey bandwidth of sixteen GB/s per x8 lane ( thirty-two GB/s bill aggregate bandwidth ) to the central processing unit. This enable each central processing unit to receive direct entree to the InfiniBand batting order, reduce bottleneck with gamey bandwidth .
each POWER9 processor engage at 3.07 gigahertz and subscribe coincident execution of two vector single-precision operation. in early words, each core can carry through sixteen single-precision floating-point process per cycle. at 3.07 gigahertz, this work out to 49.12 gigaFLOPS of peak theoretical performance per core. vitamin a full node have adenine little under 1.1 teraflop ( displaced person ) of bill performance from the central processing unit and about forty-seven teraflop ( displaced person ) from the GPUs. notice that there be vitamin a slender discrepancy between our count and ORNL ’ second. buddy bland, OLCF broadcast director, inform uracil that their peak performance for acme only include the GPU ’ sulfur peak operation numeral because that ’ s what most of the FP-intensive code will use to achieve the high performance. in theory, if we be to admit everything, peak actually have angstrom gamey top out operation of about ~220 petaFLOPS. there equal 1.6 terabyte of NVMe flash adapter attach to each node and Mellanox Infiniband EDR NIC .
Full Node Capabilities | ||||
---|---|---|---|---|
Processor | POWER9 | V100 | ||
Count | 2 | 6 | ||
FLOPS (SP) | 2.161 TFLOPS 2 × 22 × 49.12 GFLOPs |
94.2 TFLOPs 6 × 15.7 TFLOPs |
||
FLOPS (DP) | 1.081 TFLOPS 2 × 22 × 24.56 GFLOPs |
46.8 TFLOPS 6 × 7.8 TFLOPs |
||
AI FLOPS | – | 750 TFLOPS 6 × 125 TFLOPs |
||
Memory | 512 GiB (DRR4) 16 × 32 GiB |
96 GiB (HBM2) 6 × 16 GiB |
||
Bandwidth | 341.33 GB/s 16 × 21.33 GB/s |
900 GB/s/GPU |