Fugaku, the Japanese computing cluster that won the Top500 race of the world’s fastest supercomputers earlier this year has taken the gold medal in the biannual contest again. This time around, it’s extended its lead over the rest of the field.
The system, built by Fujitsu for Japan’s RIKEN Center for Computational Science, posted a maximum sustained performance level of 442,010 teraflops per second on the Linpack benchmark. The non-profit Top500 organization released its November 2020 results this week.
That’s a 6.4 percent speed improvement over Fugaku’s score on the same test, posted last June. During the official announcement of the results Monday, RIKEN’s director, Satoshi Matsuoka, attributed the improvement to finally being able to use the entire machine rather than just a good chunk of it. The November 2020 Top500 list shows that the machine’s core count has increased by about 330,000 – the equivalent of about 6,912 additional Fujitsu Arm A64FX processors.
Looks Like a Butterfly…
RIKEN couldn’t use the system’s full power for the June competition because his team simply didn’t have enough time, Matsuoka explained. “We only had two weeks to the deadline from the time we brought up the final nodes of the machine. . . We had very little time, and a lot of benchmarks had been compromised.”
Since then, his team had ample opportunity not only to bring up the remaining nodes in the full cluster but also to fine-tune the code for maximum performance.
“I don’t think we can improve much anymore,” he said.
Fugaku scored what the supercomputing industry now calls a “triple,” scoring first on the accompanying HPCG and HPL-AI benchmark tests. HPCG provides what long-time Top500 co-maintainer Martin Meuer called “another angle into the hardware,” giving more favorable scores to systems that are tuned for efficiency and control of memory bandwidth, for instance. HPL-AI is a much faster running test, trading high-precision floating-point operations for lower-precision math more commonly used in machine learning — for example, for training convolutional neural networks.
Fugaku is one of only two systems in the Top500 with custom processing architectures. The other is Sunway TaihuLight, in fourth place this time around but previously the champion for two consecutive years, built for China’s National Supercomputing Center in Jiangsu province.
…Stings Like a Bee
Fugaku’s extended performance lead represents even more success for Arm architectures, which have been receiving much greater scrutiny now that the physical barriers of Moore’s Law appear to have been reached. We asked Matsuoka whether there are any lessons to be learned from Fujitsu’s development of its own supercomputer-exclusive processors that can explain why Fugaku continues to enjoy such a performance advantage over x86-based systems on the list?
Lesson number one, he told DCK, was the importance of co-design. “When, globally, we embarked on this endeavor of trying to reach exascale [1 exaflop per second and beyond], the emphasis was not just placed on achieving exascale, but really to excel in application performance.
Analyzing what it would take to reach that goal with the technology available in 2020, his team concluded that it wouldn’t be possible with off-the-shelf chips. “We had to rethink what we would do,” he said.
At that time, the eventual co-designers of A64FX determined that, at its current growth trajectory, memory bandwidth available to x86 server-class CPUs would touch 200GB/s (gigabyte per second). For the applications RIKEN was planning, they would need 1 TB/s. Soon, AMD, Samsung, and SK Hynix started collaborating on High-Bandwidth Memory (HBM), in which bandwidth could be increased by stacking DRAM modules atop one another. That might score some points with cloud or enterprise server manufacturers, but he said HBM would cause issues with respect to cache hierarchies and chip interconnects — factors that would have been detrimental to the deterministic performance that HPC components require.
“It was very important to think about what would be your objective, what would be the target, and really start from a clean slate,” RIKEN’s leader told us. “Think about whether your goal would be met by off-the-shelf processors, or you’d really have to invent your own. In our case, it was the latter.”
The Moore’s Law Pitfall
Up until a few years ago, supercomputer operators looked to Moore’s Law to provide a rough estimate (the doubling of compute capacity in new systems every 18 months) of how long they should expect to operate the machines they have before taking the plunge and investing in new ones. Specialists know how much power they’ll need from their machines to solve particular problems, and Moore’s Law projected when machines would become available that could deliver such power, if they didn’t have it at the time.
Earlier this week, during the Supercomputing 2020 conference (this year held online), Berkeley National Labs’ Erich Strohmaier presented a chart clearly depicting the chughole left open by the decline of Moore’s Law. Since the list was first kept in 1994 up until 2013, according to this chart, performance growth among all machines in the Top500 grew in fits and starts, averaging out to 180 percent annually, or 1000x over an 11-year period. The saw-toothed line during this period could be attributed to the famous “tick-tock” cadence of Intel’s innovation agenda. During this phase, Intel tended to represent the lion’s share of CPUs in supercomputers, although there were years when AMD was competitive.
In 2013, said Strohmaier, there was a significant downturn in the global economy, which prompted research institutions to extend their investments in existing systems, postponing their planned replacements by at least two years. At this point, annual growth stumbled to an average of about 140 percent annually, even though the economy was in recovery up until the pandemic.
“After June 2013, all segments in the Top500 basically slowed down to the new growth rates,” said Strohmaier. “Those rates are substantially lower than before. . . If you multiply that out, it now takes about 20 years for an increase of 1000x.”
The pace of commercial HPC owners replacing their systems, already slowed by the previous economic slowdown and the erosion of Moore’s Law, was only deepened by the pandemic, Strohmaier reported. Since 2013, the semi-annual replacement count for commercial systems had stayed roughly the same but plunged below the number of academic systems replaced.
“If anything, in 2020, the number of new systems in research centers has actually slightly increased,” he said. “Business as usual. The research centers have not delayed any purchases or installations in large numbers due to COVID. That has not happened.”
The pandemic may only have served to dampen what was already cooling enthusiasm for high-performance computing among commercial operators, including interest in GPU accelerators and other new architectures. Among the top 50 systems on the list, 17 still do not use accelerators at all. Another one-third use Nvidia-brand GPU accelerators, while the remainder use other types, such as FPGA or Intel Xeon Phi.
Among the top 100 systems on the list, about half of academic and half of commercial systems are powered by Intel’s x86 processors (not AMD’s) without any accelerators at all. It’s astonishing, Strohmaier said, that accelerators haven’t permeated the academic market anywhere near the extent predicted. Of the commercial systems in the top 100, about 80 percent use Nvidia-brand accelerators.
This is resulting not just in a mix of systems across operator classes, but a kind of settling of those classes into their own peculiar profiles.
“Research institutions are much more willing to adopt new technologies, to try out new things, and to go with architectures that are not as common in the market.” For example, IBM Power AC922 systems coupled with Nvidia GPUs (eight of the current Top500 systems), AMD-based systems (24 of 500), and Arm-based systems (three of 500).