Self-supervision signals for video-based human mesh recovery ‒ CVLAB ‐ EPFL

Description

Precise human body mesh recovery (or HMR) is a long-standing research topic due to its crucial role for human motion understanding. Recent works proposed very effective solutions for single-frame inputs (as original HMR [1] or more data-efficient [2]) that give the prediction for a body-centered image as well as for body-centered videos (e.g., TCMR [3]).

When the single frame-based methods are aimed at predicting accurate bodies aligned with the body on the image, the video-based methods must recover plausible body motion, too.

The main drawback of existing state-of-the-art methods is that they usually are trained end-to-end and require a large amount of full body supervision, where for every datasample there is the accurate parametric form of the body. Such data is very expensive to acquire that does not let the models to be adapted to in-the-wild scenarios.

We propose to explore self-supervised motion-based signals, such as:

temporal smoothing, e.g. 1€ filter [4],
optical flow, e.g. RAFT [5],
texture consistency, e.g. TexturePose [6],
human body motion priors, e.g. HuMoR [7]

All of these signals usually do not explicitly use body-related supervision and trained on synthetic data, yet they provide solid motion-related signal that can be used for our purposes. Also, these signals are cheap to obtain and use, sometimes pretrained off-the-shelf models can be used.

We are going to explore how helpful these signals can be for improving single-frame models to make them more motion-robust and develop our solution using these signals as a source of supervision.

Prerequisites

The candidate should have programming experience in Python and Pytorch, as NN framework.

The main requirements are curiosity to learn new and willingness to overcome difficulties.

Contact

If you are interested in this project, please send an email to Andrey Davydov. More details and explanations on this project can be provided in-person or via Zoom.

References

[1] End-to-end Recovery of Human Shape and Pose, A. Kanazawa et al., CVPR, 2018.

[2] Exemplar Fine-Tuning for 3D Human Pose Fitting Towards In-the-Wild 3D Human Pose Estimation, H. Joo et al., 3DV, 2020.

[3] TCMR: Beyond Static Features for Temporally Consistent 3D Human Pose and Shape from a Video, H. Choi et al., CVPR, 2021.

[4] 1€ low-pass filter.

[5] RAFT: Recurrent All Pairs Field Transforms for Optical Flow, Z. Teed and J. Deng, ECCV, 2020.

[6] TexturePose: Supervising Human Mesh Estimation with Texture Consistency, G. Pavlakos et al., ICCV, 2019.

[7] HuMoR: 3D Human Motion Model for Robust Pose Estimation, D. Rempe et al., ICCV, 2021.