In concert with federal agencies, Cray is building a storage system designed to handle exabyte-level data-intensive workloads. The system will use Cray’s ClustorStor storage file solution running on the Oak Ridge National Lab’s Frontier exascale supercomputer, running the Cray Slingshot high-speed interconnect.
The new system will consist of more than 1 exabyte of hybrid flash and high-capacity storage running the open source Lustre parallel file system. When it is ready for deployment in 2021, it is expected to be the world’s most powerful supercomputer, with performance of more than 1.5 exaflops. It will have more than four times the capacity and throughput of ORNL's existing Spectrum Scale-based storage systems in the Summit supercomputer.
“We use only a thin layer of flash to provide the performance/IOPS for the small/random I/O requirements together with a deep HDD storage layer,” said Uli Plechschmidt, a senior director at Cray. “We tie those two tiers together with intelligent software so that all the data is in one single filesystem/in on single namespace.
With this type of power, researchers working on science, medical and even engineering projects will be able to more quickly create and process detailed global weather simulations, seismic and earth studies, genomic analysis, and nuclear and fusion reaction modeling, said Thomas Coughlin, a data storage consultant.
The demand for of very large data sets grows year by year, Coughlin noted. Within the next few years, the areas of science, engineering, entertainment, astronomy, weather modeling, medicine and video production will be generating hundreds of petabytes of data per project, approaching or exceeding the exascale level. This will require enough storage, memory, networking and processing capabilities to analyze data fast enough to be of use—something this solution could achieve.
“They point in particular to doing various AI modeling, which could include the use of machine learning to, for instance, match genomic markers and protein modeling with disease studies and help to create customized medical treatments," he said. Another possible example, noted Coughlin, is the use of deep neural networks to create more accurate weather predictions using historical total global weather data.
AI analysis can utilize both random accessed data and streamed data. The random access data can be stored on flash and other sorts of high-performance solid-state memory, while streamed data can be stored on HDD systems. This optimizes the trade-offs in performance and cost, Coughlin said.
“With the large quantity of data for these models, getting this mix of performance and cost right allows building a bigger, more powerful system for less money and thus being able to do more sophisticated analysis work,” he said.
The new supercomputer will process machine workflows much differently than the way they are processed today, Plechschmidt said. Most AI and machine learning storage systems today use an all-flash approach. This works well, he said, but they are very expensive.
“It eats up more and more of the overall budget that could be better spent on GPUs and additional data scientists,” Plechschmidt added. “It’s the same I/O performance at half the cost--or, put the other way around, double the I/O performance at the same cost.”