The Fugaku in Japan was the previous champion of supercomputers more than doubling the compute capacity of Summit in Oak Ridge, Tennessee.
What makes Fugaku different? First of all, its amazing simplicity, I think. They are using very low power 48-core ARM chips running at a paltry 2.2GHz.
There are no GPU accelerators. Consequently the supercomputer is very efficient and relatively inexpensive. They can budget for the staggering number of 158,900 processors!
The Frontier system at the Oak Ridge National Laboratory (ORNL) in Tennessee, US is now the fastest computer in the world and is the only exascale machine reported with a High-Performance Linpack (HPL) exceeding one Exaflop per second (1 EFLOP/s).
The system is based on the HPE Cray EX235a architecture and is equipped with AMD EPYC 64C 2GHz processors and 8,699,904 total cores.
Frontier increased its HPL from 1.02 EFLOP/s in November 2022 to 1.194 EFLOP/s in June, a 17% increase. Exascale was seen as only an aspirational goal just a few years ago, indicating the rapid pace of technology development.
A Staggering 7.6 Million CPU Cores
Fugaku and Summit ironically are tied in efficiency at 14.7 GFLOPS per watt despite their vastly different architecture—Summit using Volta V-100 GPU acceleration and 22-core IBM Power9 server CPUs.
In a sense, Fugaku mimics GPU acceleration with its vast array of small ARM processors. Yes, it may take 100 of the ARM cores to match the performance of a single V-100, but the ARM cores can most likely exceed the V-100 GPU cores in IPC.
Okay, well…crunching the numbers, Summit has 27,000 Nvidia GPUs with 5120 cores each. So the 7.6 million ARM cores used in Fugaku outperform the 141 million GPU cores on Summit. This is a significant difference that leads me to believe that the IPC of the ARM cores is about 16X greater than the IPC of the GPU cores.
I have a feeling that these new wafer scale computers from Tesla and Cerebras are going to be some brutally fast monsters. The Cerebras packs 850,000 AI processor cores onto a single wafer-scale chip about the size of a sheet of notebook paper.
Then they are crammed into a monstrous cooling box that looks like this.
This looks like a combination of air cooling and refrigerant lines. Cool is an appropriate word. The Cerebras chips go for a cool $2M each. You could build about two hundred 64-core RTX 3090 Threadrippers for that price.
Every year, the situation changes.
The Frontier system at the Oak Ridge National Laboratory, Tennessee, USA remains the No. 1 system on the TOP500 in June 2023 and is still the only system reported with an HPL performance exceeding one Exaflop/s.
Frontier brought the pole position back to the USA one year ago on the June 2022 listing with an HPL score of 1.194 Exaflop/s.
Frontier is based on the latest HPE Cray EX235a architecture and equipped with AMD EPYC 64C 2GHz processors. The system has 8,699,904 total cores, a power efficiency rating of 52.59 gigaflops/watt, and relies on Slingshot-11 interconnect for data transfer.
The top position was previously held from June 2020 until November 2021 by the Fugaku system at the RIKEN Center for Computational Science (R-CCS) in Kobe, Japan. With its HPL benchmark score of 442 Pflop/s, Fugaku is now listed as No. 2.
The LUMI system at EuroHPC/CSC in Finland entered the list in June 2022 at No. 3. It is listed as No. 3 after an upgrade of the system last november and has an HPL score of 309.1 Pflop/s. With this it remains the largest system in Europe.
The Leonardo system at EuroHPC/CINECA in Italy was first listed six month ago at No. 4. It was upgraded as well and remains No. 4 with an improved HPL score of 239 Pflop/s. It is based on the Atos BullSequana XH2000 architecture.
What is the fastest a computer can possibly be?
Answer – The major limitation is the speed of light
There have been computers made with an instruction cycle time of 2 nanoseconds (and that depends on what instruction you are doing, some can be much faster) – this was due to the need for the signals in the CPU to travel a total distance (with all the changing directions, extra gates and such) of about 29 centimeters (if the path were stretched out straight).
With overlapped instruction operation, the result is much faster than the base number would appear as two or more instructions may be processed at the same time. But then the problem is still in getting data from memory, and getting the result back in memory…
And that cycle is about 10 nanoseconds… Which can also be helped with multi-ported memory (but that needs the system to support two or more simultaneous buss accesses…)
This was what Seymore Cray attacked. His system used 4 ports to main memory (and all of main memory was equivalent to todays cache) for ONE CPU.
One port was specific for instructions, two ports were used for data input, and the last port was for data output. With overlapped memory operations it was possible to have a rather long setup time… but after that, the CPU had one unit of data available on each clock tick (of 2 nanoseconds).
Now when he added more processors… he added more parallel memory ports… four for each CPU. This maintained the throughput, but multiplied the cost…
Current processors can be faster – but only when the cache (L3 and below) contain the data needed. Any cache miss becomes a speed penalty… and results in throughput dropping.
There is also the problem that most memory is not more than dual ported, so having 8 busy cores would overload that… and introduce additional delays. So manufacturers try adding more cache… (Intel now has 8MB sized cache).
So “how fast can a processor be” gets to be problematical. It all depends on the system as a whole, not just the processor.
Right now, the Intel architecture has a number of built in inefficiencies for backward compatibility. The CPU is faster than the system can supply data from peripherals… If the peripherals speed up, then the CPU access to memory has to slow down
The Fastest Computer In The Word – Frontier (Current)
Built on the HPE Cray EX235a architecture, Frontier is powered by AMD EPYC 64C 2GHz processors, totaling 8,699,904 cores. Since November 2022, the system has boosted its HPL performance from 1.02 EFLOP/s to 1.194 EFLOP/s by June, marking a 17% improvement.
Achieving exascale computing, once considered an ambitious goal, now highlights the rapid advancements in technology.
Before Frontier, the Fugaku system at the Riken Center for Computational Science (R-CCS) in Kobe, Japan, held the title of the fastest supercomputer from June 2020 to June 2022, with a performance of 442 PFLOP/s. Fugaku, powered by Fujitsu’s 48-core A64FX system on chip (SoC), was the first top-ranking system to utilize ARM processors.
Each node contains one optimised third-generation AMD EPYC processor and four AMD Instinct MI250X accelerators for a system-wide total of 9,472 CPUs and 37,888 GPUs.
These nodes provide developers with ease of programming for their applications owing to the coherency enabled by the EPYC processors and Instinct accelerators. HPE’s Slingshot interconnect is the world’s only high-performance Ethernet fabric designed for HPC and AI solutions.
By connecting several core components for improved performance (e.g., CPUs, GPUs, high-performance storage), Slingshot enables larger data-intensive workloads that would otherwise be bandwidth limited and provides higher speed and congestion control to ensure applications run smoothly.
Owing to this unique configuration and expanded performance, teams have taken a thoughtful approach to scaling the interconnect to a massive supercomputer such as Frontier, made up of 74 HPE Cray EX cabinets, to ensure reliable performance across applications.