Novel high-efficiency two-phase cooling techniques for heterogeneous high-performance computing servers ‒ ESL ‐ EPFL

Contact person: Dr. Marina Zapater ([email protected]), Prof. David Atienza ([email protected])

Partners

Heat and Mass Transfer Laboratory (LTCM), EPFL, Switzerland;

Presentation:

In the energy challenge context, next-generation of High-Performance Computing (HPC) systems must deliver superior computing capacity while managing smartly power consumption and reducing operating costs. One of the main contributors to power comes from the cooling system, which is still air-based in around 80% of all data centers worldwide. The most energy efficient data centers in the world, such as the ones at Google, are able to reach a ratio of cooling power to total power, a.k.a Power Usage Effectiveness (PUE), of around 1.11, while the world average lags behind with a PUE of 1.5. This is because of the weak heat transfer coefficient of air, however, implies spending a huge amount of fan power to efficiently cool down server processors. In addition, there is a need for extracting the heat from air, which requires power-hungry cooling equipment. Furthermore, this heat cannot be reused to warm up neighboring buildings due to the weak heat transfer coefficient and low heat capacity of the air, making it economically unfeasible.

Although single-phase water-cooled systems already demonstrate efficient heat transfer characteristics, drastically reducing power consumption ), they still present potential risks for the electrical components, due to the presence of water on top of electrical equipment. Moreover, they often require server re-design. An emerging and promising solution is to use two-phase refrigerant-based cooling systems. In one hand, the two-phase flow increases considerably the heat removal keeping the chip under safe temperatures and, on the other hand, the dielectric nature of the coolant minimizes risks.

The ultimate solution is to couple passive two-phase systems relying on gravity to drive heat/flow to the larger heat sink and micro-channels heat exchanger to improve the heat transfer area. The first one, called thermosyphon, works under a simple principle: inside a closed loop, the heat source (i.e., the chip) evaporates the liquid and at a higher altitude, a heat-sink condensates the liquid. Since the column of pure liquid is heavier than the two-phase column, the fluid moves automatically due to gravity.

The water used to cool down the heat sink needs smaller pumps and, since the heat transfer coefficient refrigerant-water is high, the whole system can work at high temperature and therefore allows efficient heat recovery on the water side. Besides, the passive nature of the thermosyphon manages automatically small variations of temperatures, reducing pumping effort and so decreasing power consumption. Besides, the two-phase flow maintains the chip with a better temperature uniformity while naturally removing the hot spots.

However, to fully exploit the characteristics of the thermosyphon, there is a need to develop thermal-aware workload management strategies that drive the thermosyphon to its highest efficiency working conditions.

Goal:

Within the framework of the MANGO H2020 European project, the goal is to design an ultra-compact thermosyphon (of at most 10cm height) to cool down a HPC-ready heterogeneous system and, in particular, its Virtex 7 FPGA. The manufactured device will need to fit the preexisting architecture and to cool down until 60W of power, while the system is running real workloads.

To achieve this goal, a homemade numerical model based on state-of-the-art correlations in steady-state has been created. This latter has been used to confirm the feasibility of the early design drafted by considering the geometrical constraints induced by the architecture. Then, static analysis in finite elements will be performed to ensure the resistance to the internal pressure. Thus, a prototype is manufactured by taking care of new kind of assembly tool as brazing paste which is more suitable for such dimensions comparing to traditional brazing.

The final goal is to put together the thermosyphon cooling device together with our thermal-aware workload allocation strategies, with the goal of achieving a Power Usage Effectiveness (PUE) of 1.02, outperforming the best solutions of the state-of-the-art.

Developed software tools:

The STEAM thermal simulator, which represented the first step towards the development of the prototype
Highly-accurate numerical model of the thermosyphon
A plug-in for the 3D-ICE thermal simulator that allows to simulate single/two-phase flow in microchannels (to be released early 2018).

Related projects:

YINS NanoTera Project: Developing a radically new thermal-aware design approach for next generation energy-efficient datacenters
MANGO H2020 European project: exploring Manycore Architectures for Next-GeneratiOn HPC Systems

Publications:

A. Iranfar, F. Terraneo, W. A. Simon, L. Dragic, I. Pilji, M. Zapater, W. Fornaciari, M. Kovac, and D. Atienza, “Thermal Characterization of Next-Generation Workloads on Heterogeneous MPSoCs.“ In International Conference on Embedded Computer Systems: Architectures, MOdeling and Simulation, SAMOS, 2017.
J. Flich et al., “MANGO: Exploring Manycore Architectures for Next-GeneratiOn HPC Systems,”2017 Euromicro Conference on Digital System Design (DSD, Vienna, Austria, 2017, pp. 478-485.