RCP GPU Cluster Service Description

The RCP Cluster offers a scalable service made up of 700+ different GPUs. The servers are physically deployed in the DC2020 Datacenter.

These GPU computing services are available to all EPFL research units, including all laboratories, centers and platforms.

The service is accessible to all researchers across EPFL.

GPU-based servers, including the high-end NVIDIA GPU (H200 141 GB, H100 80 GB, A100 80 GB, A100 40 GB), and performant network connectivity (100 GbE) are used for this service offer.

Users access the service by first building their Docker image and then launching their job through a custom scheduler using the Kubernetes orchestrator. 

Types of workloads

There are two different types of workloads:

 

 

Interactive jobs

Train jobs

Purpose

Testing & Development.

Training & Compute.

Maximum GPUs

  • User: 1x A100 or 1x V100.
  • Lab allowed max quota (8x A100 AND 4x V100).

Available GPUs.

Maximum duration

12 hours.

Until job finishes.

Preemption?

No

Yes (jobs automatically restarted).

Job priority

High

Low

Distributed workloads?

No

Yes

Considerations

  • Use job checkpoints to resume after job killed.
  • Avoid using sleep infinity.

Data journey

It is worth noting that the NAS collaborative storage (NAS1) can be accessed with a QoS in this environment, this storage (NAS1)  is not to be used for simulations input/output. The High-Throughput storage subsystem (NAS3) is dedicated to this.

 

Onboarding

We do regular sessions – once a month – at the AI center lounge in ELE 117 – check out the events page

During this session, we will discuss the various concepts, how our GPU cluster works, the data journey, and how to submit jobs. This way, you can easily get started with our CaaS.

You are welcome to bring your laptop and we will help each person individually.

We also do specific sessions per laboratory, if there are a group of person who would like to onboard, and we can also do the same for associated campus. You can contact us for these.

Some useful links