As enterprises embrace artificial intelligence, deep learning and machine learning to glean more value from data and make better business decisions, many are running into storage roadblocks. Often, traditional scale-out file and object storage platforms simply don’t have the horsepower to run these processes, especially if they must be run in real time.
To address these issues, VAST Data teamed with NVIDIA to build a reference architecture that can handle petabyte-scale AI workloads without performance lags. The joint reference architecture, which combines VAST Data’s LightSpeed universal storage platform with NVIDIA’s DGX A100 universal AI system, delivers more than 170GB/s of throughput for both GPU-intensive and storage-intensive AI workloads. The solution uses VAST’s capabilities, including NFS (Network File System)-over-RDMA, which allows NFS to run natively at full line speed; NFS Multipath, which aggregates all network bandwidth to establish a single NFS connection; and NVIDIA GPUDirect Storage, as well as a converged fabric design.
This approach to building AI clusters opens up all of the high-bandwidth pipes that were previously reserved for machine-to-machine communication and also makes them available for storage I/O, said Jeff Denworth, co-founder and CMO at VAST Data.
The new reference architecture provides eight times more available bandwidth per NVIDIA DGX A100 server. It also enables storage to scale to handle up to four NVIDIA DGX A100 servers, delivering more than 140GB/s of I/O bandwidth.
The reference architecture can help get processes that require low latency, high availability, high throughput, and the ability to handle both random small-file, metadata-intensive I/O and sequential large-file I/O up and running quickly. This includes big data analytics, AI/ML/DL training, inferencing type workloads, AI video and geospatial imagery analysis.
“The need for accelerated compute is really what led to the tremendous success of NVIDIA in these markets with their GPUs,” said Eric Burgener, a research vice president in IDC’s Platforms and Technologies Group. “The accelerated performance of GPUs was welcomed, but traditional storage platforms couldn’t really keep them fed with data, which meant that there were inefficiencies; GPUs were being underutilized because of storage limitations.”
Burgener explained that existing unstructured storage platforms are optimized either for random, small-file, low latency, high availability, metadata-intensive I/O or sequential large-file high throughput and bandwidth I/O. This leads to siloed environments where customers decide what workload they have and buy the appropriate unstructured storage platform.
“The NGAs in large part require a single storage platform that can handle all of these I/O profiles at the same time, which is very different from traditional batch-oriented analytics,” he added. “By combining newer technologies like storage-class memory, NVMe and NVMe over Fabrics, software-defined storage and scale-out architectures in an entirely new design, VAST Data has created a single platform that can simultaneously handle both sets of requirements cost-effectively.”
While it’s certainly possible to assemble the compute, storage, networking and software needed to run these types of scale-out analytic workloads, reference architectures can simplify the process. They can also simplify use, purchase and support.