Cooling Artificial Intelligence

Last summer we did a post discussing the development of dedicated systems for cooling artificial intelligence systems and how the future of AI relies more on thermal management than ever. This spring has only confirmed how the pioneers of artificial intelligence systems require higher cooling capacities.

The Next Generation: Google’s TPU 3.0

Google recently announced the release of its next generation of cloud processing, the third generation of the Tensor Processing Unit (TPU 3.0). As with any cutting edge product announcement, the specs of the previous model pale in comparison to the latest and greatest version of the TensorFlow deep learning network. The CEO of Google, Sundar Pichai, rattled off some impressive claims for the TPU 3.0 pods operating over 100 petaflops, putting the computational performance of the 3.0 at over 8 times of the previous generation.

Liquid Cooling Artificial Intelligence - Google TPU 3.0

Feeling the Heat

Though the higher processing power that drives these impressive learning networks comes at a cost. The silicon that perform these computational operations run much hotter than their predecessors. In the last generation, Google did what it could to push forced convection cooling with air to the limit by augmenting zipper fin heat sinks with rapid two phase heat transfer that heat pipes can provide in a solution.

Zipper fins themselves are one of the higher performing heat sink types because of the ability to cram in as much surface area in a tight volume. Heat pipes extend that capability by being able to transfer the heat more evenly to larger fin stacks where you can utilize more volume.

Taking the Leap into Liquid Systems

You hear that Mr. Anderson?… That is the sound of inevitability.

It was inevitable that Google needed to make the jump to liquid for cooling artificial intelligence systems for their cloud platform. Higher performance chips with more processing power require higher cooling capacity to keep the silicon operating at peak performance. Google had to make the leap to a fluid that had a higher heat capacity than air and that almost always means a liquid.

The TPU 3.0 board utilizes 4 main chips for the heavy lifting and not surprisingly, those are the key pieces of hardware that need extensive cooling.  Each of these chips has its own liquid cold plate hooked up in series to one inlet and outlet for each TPU board. For each chip, an aluminum mounting chassis houses a copper cooling core. We can assume these copper cores have additional interior surface area to further increase the heat transfer from the liquid cold plate to the liquid flowing through.

Based off the port locations, it also appears that Google is utilizing impinging flow to further increase the amount of heat transfer into the liquid. Balancing impinging flow and pressure drop in a complex system like this takes a lot of design work and careful planning. That’s to be expected to an industry pioneer like Google.

Balancing the Use of Aluminum and Copper

Copper and Aluminum Raw Material

Copper is an obvious choice for the liquid cold plate cores since its thermal conductivity is around 3 times higher than aluminum. The design choice to use both copper and aluminum in the cold plate is most likely a cost and weight saving one. Since copper is heavier and more expensive than aluminum, the design engineers made the choice to keep aluminum everywhere they didn’t absolutely need the added heat transfer.

Weighty Decisions

Liquid Cooling Artificial Intelligence - Google TPU 3.0 Pod

From a casual stand point, the choice might seem mostly driven by cost due to the high quantity of cold plates, considering there are 4 cold plates per board, 32 board per rack, and 8 racks in a pod. But each of these cold plates is filled with liquid, and the network of hoses tying each of the cold plates together and to the rest of the system are all filled with liquid. Typically, engineers who utilize racks like these don’t need to consider the effects of added weight support, but with so much liquid running over every cold plate and through every tube, it all adds up. The Google TPU 3.0 obviously benefits from the reduced weight a hybrid copper and aluminum heat pipe offers.

Scratching the Surface

Another interesting thing is there are 10 extruded aluminum heat sinks in addition to each of the liquid cold plates. It’s only natural that as the power of the main processors increase, the surrounding electrical components also increase in power to handle the extra processing. With the inevitable demand for more computational power and stronger AI systems like the Google Cloud Platform, it’s only a matter of time before these components are liquid cooled.

The Future of Artificial Intelligence is Wet

It won’t be surprising if we see Google unveil even more complex cooling systems if they continue to develop their AI systems. If more components require liquid cooling, 4 cold plates might end up being combined into a single piece with more interface regions. Chips might require direct case contact, forgoing the layer of thermal grease and a layer of copper to directly dump heat into the fluid. Direct contact comes with a whole array of risks and design challenges, but to pursue more advanced AI systems, it may be required. Google may even look to full liquid immersion racks in the future as processors are pushed to their computational and thermal limits. Either way, if we keep demanding more power out of our electronics, we’ll need to use liquid for cooling artificial intelligence systems of the future.

Like our blog? Sign up for the Aavid Genie newsletter to get updates on the latest blog posts!

Spread the word. Share this post!