
Imagine a robot that not only follows instructions, but also autonomously decomposes complex goals, perceives the world, and takes actions like a human. This would mark a significant step toward general-purpose embodied intelligence, with applications in autonomous exploration in space, personal assistance, industrial automation, and beyond. This project offers the opportunity to work on frontier topics at the intersection of generative modeling, robotics, and embodied intelligence, contributing to the foundation of scalable, general-purpose robotic systems.
In this project, we aim to develop an advanced learning algorithm that can, in a human-like manner, perceive and understand the physical world, and generate appropriate embodied decisions to accomplish complex tasks. To this end, we focus on cutting-edge techniques such as Diffusion Models, LLMs, VLMs, and Generative Modeling, exploring how to integrate them into Embodied AI systems. This project centers around two core challenges: designing efficient diffusion-based robotic policies, and developing GPT-powered scene generation methods.
This will involve several subprojects in the following areas:
Efficient diffusion-based robotic policies
Problem
Recent work such as Diffusion Policy[1] and Consistency Policy[2] has demonstrated the potential of diffusion models[3] as powerful motion planners for robotic tasks. However, current implementations are too slow and too large for real-time, on-device applications. This makes them unsuitable for deployment in latency-sensitive or power-constrained environmentsâsuch as drones, home assistants, or mobile manipulatorsâwhere fast and lightweight control is essential.
Significance
Improving the efficiency of diffusion-based robotic policies could unlock fully self-contained, real-time intelligent control, allowing robots to operate exclusively on their onboard hardware, without reliance on external servers or cloud computation. This would be a crucial step toward truly autonomous, deployable robotic systems.
Challenges
Tackling these limitations requires innovation in model architecture, conditioning mechanisms, and sampling acceleration techniques, balancing the trade-off between representation capacity and computational efficiency.
Student Contribution
Students will contribute to the design and development of such efficient diffusion policy modelsâbringing diffusion-based visuomotor algorithms from research prototypes to real-world deployment.
GPT-powered scene generation
Problem
Just as GPT models rely on massive, high-quality training data, embodied AI systems require structured, diverse, and well-annotated data to generalize across tasks and environments. However, current embodied datasets are fragmented, inconsistently formatted, and expensive to collectâposing a major bottleneck to scalable training.
Significance
Synthetic data is an appealing solution due to its scalability and precision. Automating its generation can dramatically reduce human effort and significantly accelerate data production. If successful, this approach could transform how we train robots by enabling GPT-style pretraining for embodied tasksâthat is, training generalizable policies on large-scale synthetic interaction data using self-supervised objectives, ultimately enabling policy generalization across diverse environments and tasks.
Challenges
Current pipelines for synthetic scene generation depend heavily on manual design, which is time-consuming and difficult to scale. Addressing this requires leveraging LLMs, VLMs, and generative models to build systems that can create diverse, realistic, and semantically coherent 3D environments from text or image prompts.
Student Contribution
Students will contribute to the development of GPT-powered scene generation algorithms capable of generating training scenes automatically and at scale. This includes exploring key components such as scene layout generation[4], and object placement policies[5], leveraging LLMs, VLMs, and generative models.
Expected Outcomes
Students will gain hands-on experience with cutting-edge techniques in generative modeling, diffusion-based policy learning, GPT-driven data pipelines, and large-scale data generation using SOTA rendering tools. Depending on progress and interest, the project may lead to a research publication and contribute to a long-term embodied AI framework.
Prerequisites
- Proficiency in PyTorch and Python is expected, with prior experience in developing non-trivial projects using PyTorch. Note: completing a coursework assignment using PyTorch and Colab alone is not considered sufficient.
- A solid foundation in deep learning, diffusion models, 3D computer vision, LLMs, VLMs, and computer graphics is strongly recommended. A strong mathematical background is considered an asset.
Candidates with strong motivation and interest in the topic are encouraged to apply, even if not all prerequisites are fully met.
Contact
[email protected]
[email protected]
References
[1] Diffusion Policy: Visuomotor Policy Learning via Action Diffusion, RSS2023
[2] Consistency Policy: Accelerated Visuomotor Policies via Consistency Distillation, RSS2024
[3] Denoising Diffusion Probabilistic Models, NeurIPS2020
[4] LayoutVLM: Differentiable Optimization of 3D Layout via Vision-Language Models, CVPR2025
[5] CAST: Component-Aligned 3D Scene Reconstruction from an RGB Image, Siggraph 2025