Semiconductors, whose components are measured in nanometers, are central to modern data centers specializing in artificial intelligence (AI) systems. However, conventional cooling methods like fans remain critical in these facilities to prevent overheating of these advanced chips.
The continuous flow of cold air is essential to keep the chips from overheating. As the costs of maintaining sufficient fans and air conditioning units rise, chip manufacturers and data center operators are exploring entirely new methods to address this cooling challenge.
On November 15, Microsoft announced its first major initiative in the private AI chip industry. Their new chip, Maia 100, is designed to compete with top Nvidia products and connects to what can be described as a ‘cold plate’, a metal device kept cool by a liquid pumped beneath its surface.
This approach could be a step towards full immersion cooling, where complete server racks operate inside tanks filled with a special cooling liquid.
For years, those needing to cool computer servers have recognized the benefits of liquid cooling. Water is about four times more effective at absorbing heat than air.
This technology has been tested by some cryptocurrency miners and adopted in data centers using cold plates for chips initially designed for standard air cooling.
Passionate video gamers seeking to maximize their computer performance and reduce noise from powerful fans have also showcased their custom water-cooled systems with illuminated water tubes.
However, liquid cooling has its drawbacks. Water is conductive, posing a risk of damaging expensive equipment, necessitating the use of alternative non-conductive liquids in case of contact with computer devices. For many large data centers, implementing an entirely new cooling strategy is a massive infrastructure project.
Operators need to consider factors like preventing floor collapse under the weight of the liquid needed for fully immersing computer racks, which can be up to seven meters tall. This has led major data center operators to continue using fans, leaving liquid cooling techniques to experimental enthusiasts.
The substantial computational requirements of AI systems have changed the equation. Developments increasing chip capacity have also doubled their power needs. More power consumption means more heat generation.
Each Nvidia H100 AI accelerator, the benchmark for AI system development, consumes at least 300 watts, about three times that of a 65-inch flat-screen TV. A data center can use hundreds or even thousands of these processors, each costing more than a family car.
Cooling is the fastest-growing infrastructure cost for data centers, with an annual compound growth rate of 16%, according to a November 2023 report by Omdia Research. Up to 40% of a data center’s total electricity consumption goes towards cooling, says Jennifer Hafner, Intel’s executive in charge of product sustainability.
Hafner remarks, “Electricity is the primary barrier restricting data centers.” Cooling challenges have forced some centers to limit certain components, leave gaps between racks, or slow down expensive chips to avoid overheating.
Microsoft’s Maia chips are designed to work alongside massive cooling units, circulating liquid through directly connected cold plates. This enables the chips to function in standard data centers. Microsoft says it will start installing them in 2024.
Mark Russinovich, Chief Technical Officer of Microsoft’s Azure cloud services division, sees a broader role for liquid cooling in all operations at their data centers.
He states, “This is a proven technology now in production,” adding that it’s been in development for a long time, including under his desk on his gaming PC.
Microsoft also plans to develop data centers capable of full immersion cooling in the next few years. While more effective than cold plates, this method requires extensive equipment verification at all levels.