Finetuning Policy via Reinforcement Learning from Human Preferences (Semester Project/ Master Thesis) ‒ SYCAMORE ‐ EPFL

Outline

This project aims to fine-tune robotic manipulation policies via reinforcement learning from human preferences (RLHF).

Motivation

In robotic manipulation, policies are typically learned via (a) RL in simulation or (b) behavioral cloning (BC) from demonstrations. However, both approaches struggle to generalize: RL policies face a sim-to-real gap, and BC suffers compounding errors that drive trajectories off the expert data. Similar to finetuning of large language models, the aim of this project is to improve the policy by leveraging expert preferences via RLHF [1-3]. As online RL on the real-world system can be expensive and potentially harmful to the system, the goal is to do this fine-tuning offline, relying just on a pre-collected explorative data set from the robotic system.

Milestones

M1 (Weeks 1–2): Literature review; familiarize yourself with the existing code.
M2 (Weeks 2–8): Implement the RLHF algorithm in simulation to test performance.
M3 (Weeks 8–13): Validation on the real-world system.
M4 (Weeks 14–16): Writing and evaluation of results.

Requirements

We look for motivated students with a strong background in machine learning and coding. We do have concrete ideas on how to tackle the above challenges, but we are always open to different suggestions. If you are interested, please send an email containing 1. one paragraph on your background and fit for the project, 2. your BS and MS transcripts to [email protected], [email protected], and [email protected].

References:

[1] Christiano, Paul F., et al. “Deep reinforcement learning from human preferences.” Advances in neural information processing systems 30 (2017).

[2] Cen, Shicong, et al. “Value-Incentivized Preference Optimization: A Unified Approach to Online and Offline RLHF.” CoRR (2024).

[3] Schlaginhaufen, Andreas, Reda Ouhamma, and Maryam Kamgarpour. “Efficient Preference-Based Reinforcement Learning: Randomized Exploration Meets Experimental Design.” arXiv preprint arXiv:2506.09508 (2025).