The potential for cost savings in switching to liquid from air as the more efficient cooling medium can be huge when you operate as many data centers around the world as Microsoft Azure does. But that’s not the primary reason that’s been driving recent experimentation with all the different approaches to liquid cooling by Microsoft engineers. The trajectory server chips are on today puts their power density well beyond what an air-based cooling system can do in just a few years.
The power requirements of CPUs are now routinely over 200W, and they’re only expected to increase. Accelerators like GPUs, for machine learning and other types of applications, are already in the 250W to 400W range. Google has been retrofitting many of its data centers to support direct-to-chip liquid cooling for its custom AI accelerators since at least sometime last year; Alibaba is using full immersion cooling in one of its data centers; Facebook’s new accelerators are so power-dense that their heat pipes and the heat sink required to cool one gives it dimensions closer to those of a cement block than a circuit board. Eight GPUs can add up to a 4kW chassis, Husam Alissa, a senior engineer on Microsoft’s advanced data center team, said in a presentation at the recent OCP summit, and 10kW and 16kW systems are on the way.
“Eventually, we'll get to the point where some of these chips or GPU solutions will drive us to require liquid cooling,” Brandon Rubenstein, principal platform engineer and manager of the thermal team working on Microsoft server development, said in the same presentation.
Vijay Rao, Facebook director of technology and strategy, holding up an OCP accelerator module during a keynote at the 2019 OCP Summit
At the summit, Microsoft shared some results of the experimentation with liquid cooling technologies it’s been doing for its Project Olympus OCP servers. The Azure team also called for more standardization for liquid cooling across the industry, suggesting that the proprietary nature of many existing solutions is holding back adoption of new cooling systems that will be so urgently needed in the next two to three years.
The Olympus chassis is already extremely efficient: the combination of fans, heat pipes, and remote heat sinks means a 40kW Olympus rack can still be air cooled. But liquid cooling has more advantages than enabling higher rack densities and lowering facility fit-out and operating costs.
It avoids heat flux, keeping the components at a more constant temperature, and copes better with failures; if a fan fails or loses power, the CPU must shut down in a matter of seconds to avoid overheating, while the thermal inertia of liquid can keep an immersion-cooled chassis functional for up to half an hour. Humidity, dust, and vibration also become non-issues with liquid. Liquid cooling makes reusing server heat (for comfort heating, for example), or even selling the heat energy, easier. Finally, it reduces water use – a factor that can limit where hyperscale data centers can be located.
Cold Plates to Full Baths
“If the data center infrastructure is not set up for [immersion cooling], a rack level solution makes the most sense because it's something you can just drop into the space you're already doing air cooling in,” Rubenstein said.
The least disruptive approach he and his team tried attached CoolIT’s microchannel cold plates to the twin 205W Skylake CPUs and direct-contact memory cooling array to the DIMMs (allowing them to be serviced without removing the cooling plate). The assembly plugged into an in-rack manifold with dripless, blind-mate connectors. Two server fans were removed and the speed of the remaining fans (which are still needed to cool other components) was reduced. Including the power needed to pump coolant through the chassis, the system used to 4 percent less energy than a traditional air-cooled one, with better savings if the system is used for an entire row of racks.
Single-phase immersion baths use hydrocarbon fluids with a heat exchanger (and indium foil to protect some components), but the viscous liquid does complicate serviceability, Alissa noted. Passive two-phase immersion paths use a vapor condenser to remove the heat and return the (expensive) low-boiling-point fluorocarbon liquid to the tank. Testing the latter on the Olympus chassis dropped the temperature at the CPU junction by some 15 degrees Celsius compared to air cooling for processors running at 70 percent to 100 percent utilization, giving system designers more headroom to handle components with increasingly high power requirements.
All these liquid cooling systems involve extra costs for the equipment and the fluid (although they’re balanced by reduced opex and savings on fans and air-handling equipment). There are also fluid handling and serviceability implications to consider. But what’s also needed are more uniform specifications and approaches that make it possible to mix and match vendors within the data center. To that end, the recently formed Advanced Cooling Solutions group within OCP is looking at rear-door heat exchangers, cold-plate, and immersion cooling systems.
“We need open specifications on DIMM and FPGA modules, and dripless connectors, and so on,” Alissa said. “We will need certification of components like motherboards, and fiber, and network equipment, rather than [running] a new experiment every time we want to try something out. We need redundancy of sources for all these technologies, and they need to be less proprietary and more commoditized.”
That’s the same problem that Dale Sartor, an engineer at Lawrence Berkeley National Laboratory who oversees the Federal Energy Management Program’s Center of Expertise for Data Centers, told us last year. “One thing that has been holding back widespread adoption of liquid cooling are standards or specifications that allow for a multi-vendor solution,” he said.
Microsoft Not Ready to Dive In
Microsoft isn’t ready to pick a liquid cooling technology and run with it yet. The company has not started deploying any of these options in its Azure data centers, but cooling could become a major issue in two to three years. Rack-level liquid cooling will be standardized and commoditized enough to adopt in one or two years, Rubenstein predicted, but suggested that more advanced whole-data center solutions could take five to ten years to mature. That should be “enough time to figure out what the most effective solution is and then the data center can adapt,” he said.
Not all the hyperscalers are planning to wait that long, though. To reduce power consumption and environmental impact, Alibaba has already shifted one of its data centers to hydrocarbon immersion bath cooling, with 2,000 servers in 60 tanks, reducing PUE by over a third from 1.5 to 1.07, according to another presentation at the OCP Summit. Alibaba’s move to liquid cooling brought some extra benefits too: the noise level went down from 95dB to 50dB, and, protected by the liquid, hard drives failed half as frequently.
Seeing the problem more clearly might spur some data center teams on. The combination of modularity and efficiency makes Microsoft’s Olympus servers an attractive option for cooling vendors who want to benchmark the improvements they can offer against what CoolIT refers to as “the best of air cooling.” The various demonstration systems on the summit’s show floor included liquid immersion systems using low-boiling point fluorocarbons. These showed the heat boiling away in a stream of bubbles from heat sources like CPUs, GPUs, voltage converters, and even voltage regulators on the back of a board, making the usually invisible problem of heat immediately visible as something you need to prepare for.