Schloss Dagstuhl Seminar:” Rack-scale Computing” ‒ DIAS ‐ EPFL

Schloss Dagstuhl is a venue in the Saarland area of southwest Germany that specializes in week long seminars in computer science. The seminars typically start on a Sunday evening and last until the following Friday with an audience of around 40 participants from the international world of academia and industry. Even though the activities are subsidized by the German government which keeps expenses low, anyone can propose a seminar on a specific topic and timeslots are usually filled around 1.5 years in advance. A typical seminar program includes lectures and small group discussions with plenty of opportunities for interaction. One feature of the organization which is worth mentioning is that during the meals, the participants have to sit at randomly assigned places around the tables which promotes unplanned interactions among participants.

Recently, I had the opportunity to attend a Dagstuhl seminar on “Rack-scale Computing” (http://www.dagstuhl.de/en/program/calendar/semhp/?semnr=15421) that featured a very mixed crowd of hardware and software people and inspired many wide-ranging discussions. The seminar grew out of the discussions at the first rack-scale computing workshops co-located with the Eurosys conference in 2014. Rack-scale computing is the emerging research area focusing on how to design and program the machines used in data centers. Interestingly, after a week of discussions there was no consensus on how the rack-scale systems will look like, but there was a shared sense that it will inspire many exciting research questions.

On the hardware side, majority of discussions focused on processors, memory and networking. We can expect to see processors with many cores of lean out-of-order designs and crossbars for communication. Also, power limitations are inspiring renewed interest in accelerators, both fixed function and programmable (FPGA) and we can expect to see further integration in System-on-Chip designs. With compute scaling faster than memory bandwidth and capacity, memory is becoming a bottleneck and in enterprise environment, customers are sometimes buying additional CPUs just to benefit from more memory. Memory bandwidth bottlenecks can be avoided by 3D stacking, while capacity can be significantly increased with the emerging non-volatile memory technologies. These technologies are expected to bring very high density without converging with traditional storage. On the storage side, 3D stacked flash and novel magnetic disk technologies will enable continued increase in storage capacity. The way how non-volatile memory will be accessed by the processors remains an open question. On the networking side, 100 Gb/s are already faster than any current software can utilize, hence silicon photonics interconnects are not expected to bring direct improvements in latency, however, it will bring higher bandwidths. Also, networks will move to distributed switching fabrics.

High bandwidth networks, large main memories and dynamic application requirements are motivating disaggregation of resources where compute, memory and storage can be combined on-demand to best fit application requirements. HP’s “the machine” and UC Berkeley’s Firebox are two early proposals for rack-scale (or datacenter-scale) designs with thousands of cores, large non-volatile memory pools and photonic interconnects.

One of the most interesting topics for me personally was the one on applications. While all participants agreed that we still cannot identify a single “killer-app” for rack-scale hardware platforms, we heard of a diverse range of applications that can benefit from such platforms: data analytics, graph analytics, traditional high-performance computing (HPC) applications, as well as applications that have elastic resource requirements. The operating system in this environment will be decentralized and should support diverse services, including fault tolerance and resource isolation.

Programming models also remain an open question as rack-scale platforms are likely to combine worst properties of multicores and distributed systems and will make programming challenging. One of the main issues would be whether to ship data or functions in order to extract locality from the application. Transactions will be a very useful abstraction in the rack-scale context and they can benefit a great deal from hardware support.

Finally, power efficiency is one of the main goals of rack-scale designs, as it is currently a major problem for datacenter operators. Even though for the last few years, many people have expected ARM64 processors to become a standard in datacenters, experience showed that transition from current datacenter architectures will be slow. One of the main reasons for slow adoption of any new technology is the amount of time it takes to rewrite software; large cloud-computing providers, however, are continuously experimenting with new hardware platforms.

Overall, the seminar was a success and everyone agreed that we learned a lot from each other during the week. The format was well received, although many people wanted a bit more time for discussions in smaller groups. I really enjoyed this event and can strongly recommend it to any fellow Computer Scientist.

by Danica Porobic