AI Inference As A Service

Service Description – AI Inference Platform

Service Overview

The AI Inference Platform went live on August 15th 2025 and has been an immediate hit. It provides a managed, scalable, and secure environment to deploy and operate AI models for real-time and batch inference.

The service enables users to consume AI capabilities through standardized APIs while benefiting from shared, cost-efficient GPU infrastructure and enterprise-grade operations.

The platform is designed for research, education, and administrative use cases, with a strong focus on sustainability, performance, availability, data protection, and operational reliability.


Key Capabilities

  • Managed on-premise deployment of AI inference services (REST / OpenAI-compatible APIs)

  • Dynamic scaling based on demand

  • Secure multi-tenant operation on shared GPU infrastructure

  • Monitoring, audit trail (no user related content), and usage reporting


Examples of Current Usage

  • Research:

    • Large language model inference for experimentation and evaluation

    • Serving fine-tuned models to research teams via API, examples include clinical studies (without sensitive data for now), and 1 national program

  • Education:

    • AI-powered assistants for students and teaching staff

    • Hands-on AI courses using shared inference resources

  • Administration & Operations:

    • Internal chatbots for knowledge access and document analysis

    • Automation, coding  and decision-support tools using AI models


Service Availability

  • Service availability target: 99.9% monthly (excluding planned maintenance)

  • Operating hours: 24/7 for production inference endpoints

  • Maintenance windows:

    • Planned maintenance communicated in advance

    • Preferably scheduled outside business hours


Service Level Agreement (SLA)

RCP provides the following:

  • Hosting of models We currently offer over 70 diverse models across LLMs, vision, embedding, reranking, and speech-to-text from popular families, including Apertus, Mistral, Qwen, Llama and more collections.

A list of all models is available in the RCP Portal (accessible only from EPFL network or VPN).

  • API – We provide an openAI-style API, which is typically a RESTful API designed for model inference, allowing users to send requests to a hosted AI model and receive responses. These APIs follow a client-server model, where the model runs on the RCP infrastructure, and clients interact with it using HTTP requests. 

Prerequisites to use the service

This service is open to every employee at EPFL. If you would like to use such, you can request an API key through the RCP Portal (accessible only from EPFL network or VPN).

Please note that models are loaded dynamically. Some models are always available (24/7 label), others are loaded only when a request is made so as to use resources in the most efficient way. This means that the first time a model is used, it may take a few minutes to respond, as it needs to be loaded first. Once loaded, subsequent responses will be much faster. Availability of the models is not an issue, a slight delay on the first request may be expected. 

If such a delay is not acceptable for your application, for example, if you need an immediate response all of the time, you can contact us and we will be happy to discuss and work towards an acceptable solution.