Towards Generalizable Embodied AI via Diffusion-Based Robotic Policies and GPT-Powered Scene Generation ‒ CVLAB ‐ EPFL

The above demonstration images are used as reference examples to illustrate the vision of general-purpose embodied AI. The featured robot system is developed by Figure AI (Credit: Figure AI, https://figure.ai).

Imagine a robot that not only follows instructions, but also autonomously decomposes complex goals, perceives the world, and takes actions like a human. This would mark a significant step toward general-purpose embodied intelligence, with applications in autonomous exploration in space, personal assistance, industrial automation, and beyond. This project offers the opportunity to work on frontier topics at the intersection of generative modeling, robotics, and embodied intelligence, contributing to the foundation of scalable, general-purpose robotic systems.

In this project, we aim to develop an advanced learning algorithm that can, in a human-like manner, perceive and understand the physical world, and generate appropriate embodied decisions to accomplish complex tasks. To this end, we focus on cutting-edge techniques such as Diffusion Models, LLMs, VLMs, and Generative Modeling, exploring how to integrate them into Embodied AI systems. This project centers around two core challenges: designing efficient diffusion-based robotic policies, and developing GPT-powered scene generation methods.

This will involve several subprojects in the following areas:

Efficient diffusion-based robotic policies

Problem

Recent work such as Diffusion Policy[1] and Consistency Policy[2] has demonstrated the potential of diffusion models[3] as powerful motion planners for robotic tasks. However, current implementations are too slow and too large for real-time, on-device applications. This makes them unsuitable for deployment in latency-sensitive or power-constrained environments—such as drones, home assistants, or mobile manipulators—where fast and lightweight control is essential.

Significance

Improving the efficiency of diffusion-based robotic policies could unlock fully self-contained, real-time intelligent control, allowing robots to operate exclusively on their onboard hardware, without reliance on external servers or cloud computation. This would be a crucial step toward truly autonomous, deployable robotic systems.

Challenges

Tackling these limitations requires innovation in model architecture, conditioning mechanisms, and sampling acceleration techniques, balancing the trade-off between representation capacity and computational efficiency.

Student Contribution

Students will contribute to the design and development of such efficient diffusion policy models—bringing diffusion-based visuomotor algorithms from research prototypes to real-world deployment.

GPT-powered scene generation

Problem

Just as GPT models rely on massive, high-quality training data, embodied AI systems require structured, diverse, and well-annotated data to generalize across tasks and environments. However, current embodied datasets are fragmented, inconsistently formatted, and expensive to collect—posing a major bottleneck to scalable training.

Significance

Synthetic data is an appealing solution due to its scalability and precision. Automating its generation can dramatically reduce human effort and significantly accelerate data production. If successful, this approach could transform how we train robots by enabling GPT-style pretraining for embodied tasks—that is, training generalizable policies on large-scale synthetic interaction data using self-supervised objectives, ultimately enabling policy generalization across diverse environments and tasks.

Challenges

Current pipelines for synthetic scene generation depend heavily on manual design, which is time-consuming and difficult to scale. Addressing this requires leveraging LLMs, VLMs, and generative models to build systems that can create diverse, realistic, and semantically coherent 3D environments from text or image prompts.

Student Contribution

Students will contribute to the development of GPT-powered scene generation algorithms capable of generating training scenes automatically and at scale. This includes exploring key components such as scene layout generation[4], and object placement policies[5], leveraging LLMs, VLMs, and generative models.

Expected Outcomes

Students will gain hands-on experience with cutting-edge techniques in generative modeling, diffusion-based policy learning, GPT-driven data pipelines, and large-scale data generation using SOTA rendering tools. Depending on progress and interest, the project may lead to a research publication and contribute to a long-term embodied AI framework.

Prerequisites

Proficiency in PyTorch and Python is expected, with prior experience in developing non-trivial projects using PyTorch. Note: completing a coursework assignment using PyTorch and Colab alone is not considered sufficient.
A solid foundation in deep learning, diffusion models, 3D computer vision, LLMs, VLMs, and computer graphics is strongly recommended. A strong mathematical background is considered an asset.

Candidates with strong motivation and interest in the topic are encouraged to apply, even if not all prerequisites are fully met.

Contact

[email protected]
[email protected]

References

[1] Diffusion Policy: Visuomotor Policy Learning via Action Diffusion, RSS2023

[2] Consistency Policy: Accelerated Visuomotor Policies via Consistency Distillation, RSS2024

[3] Denoising Diffusion Probabilistic Models, NeurIPS2020

[4] LayoutVLM: Differentiable Optimization of 3D Layout via Vision-Language Models, CVPR2025

[5] CAST: Component-Aligned 3D Scene Reconstruction from an RGB Image, Siggraph 2025