supercomputer fugaku.jpg RIKEN Center for Computational Science
Arm-powered Fugaku, in Kobe, Japan, is the the world's fastest supercomputer as of November 2020, according to Top500.org.

Machine Learning Is Helping High-Performance Computing Go Mainstream

Proliferation of AI, in combination with cloud platforms making it easier to test the waters, have led more IT organizations to turn to HPC-style infrastructure.

Once confined to specialized workloads with high compute requirements, such as academic and medical research, financial modeling, and energy exploration, high-performance computing, or HPC, has in recent years been finding its way into IT of all stripes.

This has partly been brought about by the mainstreaming of machine learning (a subset of artificial intelligence), which generally operates at a snail's pace on conventional servers and needs the added oomph that HPC brings to the table.

So what is HPC? Like much in life, its boundaries aren't clearly defined and its part of a continuum.

Although the term is often used interchangeably with supercomputers, behemoth systems such as Fugaku -- which employs close to 160,000 processors to produce 415.53 petaflops -- HPC systems range from clusters of garden-variety racked x86 servers and storage devices to supercomputers like Fugaku.

They're much faster than typical servers, generally employing much more silicon than conventional systems, both as CPUs and GPUs, the latter being used to "accelerate" the system, or offload some of the number crunching from the former. Increasingly, they take advantage of specialized silicon, such as RISC-V chips designed into the networking to handle some compute while the data is in transit.

They're also costly, both in terms of CaPex and OpEx, which is why you're much more likely to find HPC systems sitting in on-premises data centers than on colocation floors.

"If you were to drop an HPC system in a colo center, generally speaking, the power density in an HPC system is going to drive the cost higher than would a normal series of web servers," John Leidel, founder and chief scientist at HPC research and development company Tactical Computing Laboratories, told DCK. "Web servers run fairly low VID cores, the memory density is not very high, they generally don't have GPUs, and things like that.

"The other thing is your typical web server is not going to run with all the cores utilized all the time, whereas on an HPC system that's exactly the situation that we want," he added. "We want to run the thing full-tilt until the wheels come off, as long as we can, as much as we can, because that's how we amortize that capital expense."

Cloud Offers a Way to Test HPC Waters

Until fairly recently, procurement and operating expenses were a big stumbling block standing in the way of deploying HPC infrastructure at many organizations, as companies couldn't determine if the benefits from an HPC deployment would justify the expense without spending the money to install the system first. That's changed, now that clouds offer HPC and HPC-like services that can be used as a relatively inexpensive on-ramp to test the waters.

Chris Porter, project manager for converged HPC and AI for IBM Cognitive Systems, told us that the pay-as-you-go services offered by the major clouds give organizations the chance to perform a benefit analysis of deploying HPC workloads without the need to make major capital investments first. Companies deciding to continue using HPC, however, often find it financially advantageous to eventually move a large part of their HPC operations on-prem, both to have more control over workloads and because high-compute workloads in the cloud can be expensive.

From Hadoop to AI

These days most of those workloads will center around AI/ML in a way that is a continuation of the Big Data workloads that brought many large companies to HPC 15 or so years ago.

"I would contend that trend is just a continuation of the Big Data trend," Porter said. "Hadoop really became the the buzzword, and then Spark followed Hadoop, to a degree, when Hadoop performance was kind of running out of gas. I think now AI is the next step, with many enterprise corporations realizing, 'Wow, we can actually get more predictive power rather than having to ask the questions and then query data. We can actually have AI start learning and actively advising us rather than forcing us to ask the right question.'"

The predictive analytics that AI/ML offers is also pushing HPC to edge deployments to overcome latency issues at the place where the compute is needed, he said.

"We're seeing HPC not only being deployed in on-prem data centers, where you maybe have very large HPC appliances or some rack-mounted HPC units," he said. "We're also seeing it in certain edge locations, like retail or the factory floor, where all that high-performance computing can be done right there on the premises, because anything you're going to need back at the corporate headquarters doesn't really require low latency."

He said that HPC is being deployed in manufacturing facilities for thing like product inspection to maintain quality control and supply chain management. Retail establishments are increasingly getting on the AI/ML bandwagon to make predictions based on customer behavior.

Pankaj Goyal, a VP of product management at HPE, is also seeing more HPC deployments at edge locations.

"I would say that many HPC techniques and AI techniques are merging together," he said. "For example, we are seeing demand with our retailers who want to use AI in their stores to improve the customer experience, and the use case might be video and text advertising in their own store, or it might be predictive modeling of customer behavior based on their history. Those techniques are typical AI training techniques, and they use a lot of underlying infrastructure which [is] common to HPC."

"If you look at it, essentially, these workloads are very data-intensive," he added. "They need to ingest a lot of data, they need to process a lot of data, and they need to output a lot of data. As a result, from a technology angle, what they need is an ability to store large amounts of data, to process large amounts of data, and to be able to throughput, exit, or output large amounts of data, and all of this done at large scale and at tremendous speed."

Hide comments

Comments

  • Allowed HTML tags: <em> <strong> <blockquote> <br> <p>

Plain text

  • No HTML tags allowed.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.
Publish