Recovering Thin Structures in 3D Foundation Models ‒ CVLAB ‐ EPFL

Recently, the task of 3D reconstruction has been revolutionized by the advent of 3D foundation models that predict scene geometry and camera parameters using only a transformer-based feed-forward network.

However, transformers are known to be less effective on thin structures, primarily because of the rough patchification. We provide an example using the state-of-the-art model DepthAnything3 (https://depth-anything-3.github.io/) as follows:

Objectives

In this project, we will investigate how to improve the model’s performance on thin structures.

Prerequisites

Proficiency in Python programming language
Familiarity with deep learning and PyTorch
Knowledge about the basics of 3D computer vision

Contact

Interested students can email [email protected]. Please also include your CV and transcript.

References

[1]. VGGT: Visual Geometry Grounded Transformer. https://arxiv.org/pdf/2503.11651, CVPR 2025

[2]. Depth Anything 3: Recovering the Visual Space from Any Views. https://arxiv.org/abs/2511.10647, ICLR 2026