Available Projects – Fall 2023 ‒ IVRL ‐ EPFL

Research Projects are open to EPFL students.

Some types of moiré patterns rely on grayscale images. These moiré patterns can be used for the prevention of counterfeits. The present project aims at creating a grayscale image editor. Designers should be able to shape their grayscale image by various means (interpolation between spatially defined grayscale values, geometric transformations, image warping, etc…).

Deliverables: Report and running prototype (Matlab).

Prerequisites:

– knowledge of image processing / computer vision
– basic coding skills in Matlab

Level: BS or MS semester project

Supervisors:

Dr Romain Rossier, Innoview Sàrl, [email protected], , tel 078 664 36 44
Prof. Roger D. Hersch, BC110, [email protected], cell: 077 406 27 09

Startup company Innoview Sàrl has developed software to recover a message hidden into patterns. Appropriate settings of parameters enable the detection of counterfeits. The goal of the project is to define optimal parameters for different sets of printing conditions (resolution, type of paper, type printing device, complexity of hidden watermark, etc..). The project involves tests on a large data set and appropriate statistics.

Deliverables: Report and running prototype (Android, Matlab).

Prerequisites:

– knowledge of image processing / computer vision
– basic coding skills in Matlab and/or Java Android

Level: BS or MS semester project

Supervisors:

Dr Romain Rossier, Innoview Sàrl, [email protected], , tel 078 664 36 44
Prof. Roger D. Hersch, BC110, [email protected], cell: 077 406 27 09

Description: Neural Cellular Automata (NCA) models are a type of computational model that extends Conway’s Game of Life, a classic example of a cellular automaton. While the Game of Life operates on a grid of cells with discrete states (either “alive” or “dead”), NCA models operate on a multi-dimensional grid with a continuous range of states. NCA models combine the strengths of cellular automata and neural networks to create a powerful tool for simulating and understanding complex systems.

In an NCA model, the update rules for each cell’s state are determined by a neural network. The neural network takes as input the current states of the cell and its neighbors and produces an output that determines the new state of the cell. By incorporating neural networks into cellular automata models, NCA models have the potential to learn and simulate complex systems, such as biological or physical systems.

If you’re interested in learning more about what Neural Cellular Automata (NCA) models can do, there are several examples you can check out. Some of the tasks that NCA models have been used for include generating images and textures, synthesizing videos, and even classifying images.

Here are a few references to explore:

Growing Neural Cellular Automata: https://distill.pub/2020/growing-ca/

Self-Organizing Textures: https://distill.pub/selforg/2021/textures/

DyNCA: Real-Time Dynamic Texture Synthesis Using Neural Cellular Automata:

https://dynca.github.io/

Self-classifying MNIST Digits: https://distill.pub/2020/selforg/mnist/

In this project, we will continue exploring NCA models and their potential for synthesizing images, videos, audio, and 3D objects. Together, we will brainstorm and select a specific topic for your project that aligns with your interests and goals. This project involves a lot of exploration and is an exciting opportunity to develop your research and critical thinking skills.

Deliverables:

• Code, well cleaned up and easily reproducible.

• Written Report, explaining the literature and steps taken for the project.

Prerequisites:

• Python and PyTorch.

• Experience with Deep Learning methods and Convolutional Networks

Level: M.Sc. and B.Sc.

Description:

Recent research shows that existing large language models (trained on text data only) seem to be good at recognizing patterns from general sequences, extending beyond textual data to include arbitrary tokens [1, 2]. Images can also be represented by tokens (e.g. pixel values), and it seems that LLMs can perform some tasks like image denoising and completion [1, Figure 8], despite not being trained on images but on text data only.

Key Question:

The central question driving this project is whether LLMs, which have proven effective at denoising simple patterns [1, Figure 8], can be used to generate images from scratch through an iterative denoising process, akin to diffusion models [3].

Can we employ the sampling algorithm employed by diffusion models [3] without the need for UNet training, replacing it with queries to an LLM?

References:

[1] Mirchandani, Suvir, et al. “Large language models as general pattern machines.” arXiv preprint arXiv:2307.04721 (2023).

[2] https://slideslive.com/39006507/interactive-learning-in-the-era-of-large-models 19:35

[3] Ho, Jonathan, Ajay Jain, and Pieter Abbeel. “Denoising diffusion probabilistic models.” Advances in neural information processing systems 33 (2020): 6840-6851.

Deliverables: Deliverables should include code, well cleaned up and easily reproducible, as well as a written report, explaining the models, the steps taken for the project and the performances of the models.

Prerequisites: Python and PyTorch.

Level: MS research project

Number of students: 1

Supervisor: Martin Nicolas Everaert (martin.everaert [at] epfl.ch)

Description:

The project aims to explore the potential advantages of employing 2D Gaussian Splatting as a technique for image representation and compression.

3D Gaussian Splatting [1] is a recent technique used to represent a scene by 3D Gaussians from a set of photos taken from different angles.

2D Gaussian Splatting [2], akin to its 3D counterpart, aims to represent images as a collection of 2D Gaussians, by optimizing their shapes, transparencies, positions, and colors. In comparison, other techniques such as DiffVG [3] or LIVE [4] focus on optimizing coordinates and colors of vector objects such as polygons and Bezier closed shapes, which is more delicate due to the non-differentiability of rasterized pixel values.

Key Questions:

Optimization Efficiency: Is 2D Gaussian Splatting [2] a faster and more stable approach to represent an image than DiffVG [3] and LIVE [4], especially considering the complexities of optimizing vector objects?

Image Quality: Given a fixed number of parameters (2D Gaussian parameters for instance) or a specific compression ratio, which method yields superior image quality – 2D Gaussian Splatting [2] or vector object-based techniques like DiffVG [3] and LIVE [4]?

Text-to-2DGaussians: Recent work like VectorFusion [5] leverages Diffusion models and score distillation sampling (SDS, [6]) to generate SVG images (vector objects) from text. Can we adapt them to use 2D Gaussians in place of vector objects?

Tasks:

Comparative Analysis: Evaluate and compare 2D Gaussian Splatting with DiffVG and LIVE in terms of optimization time, reconstruction quality, training stability, and other relevant performance metrics.

Compression Algorithm Design: Design some algorithm to use such methods for image compression. Compare the different methods.

Text-to-2DGaussians: Implement an algorithm to perform Text-to-2DGaussians using SDS loss (or similar) as optimization objective. Assess the feasibility and quality of the technique.

References:

[1] Kerbl, Bernhard, et al. “3d gaussian splatting for real-time radiance field rendering.” ACM Transactions on Graphics (ToG) 42.4 (2023): 1-14.

[2] https://github.com/OutofAi/2D-Gaussian-Splatting

[3] Li, Tzu-Mao, et al. “Differentiable vector graphics rasterization for editing and learning.” ACM Transactions on Graphics (TOG) 39.6 (2020): 1-15.

[4] LIVE: Ma, Xu, et al. “Towards layer-wise image vectorization.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022.

[5] Jain, Ajay, Amber Xie, and Pieter Abbeel. “Vectorfusion: Text-to-svg by abstracting pixel-based diffusion models.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023

[6] Poole, Ben, et al. “Dreamfusion: Text-to-3d using 2d diffusion.” arXiv preprint arXiv:2209.14988 (2022).

Prerequisites: Python and PyTorch.

Level: BS or MS research project

Number of students: 1

Supervisor: Martin Nicolas Everaert (martin.everaert [at] epfl.ch)

Introduction

Computed Tomography (CT) images are particularly helpful for medical diagnosis of multiple diseases, especially for brain lesions analysis. While High-Dose CT images are easy to interpret thanks to their sharp contrast, the chemical product and the X-rays radiation used to enhance the contrast is invasive and toxic for the patient. In that regard, most CT imaging techniques are performed with low-dose. This implies that the resulting image is less contrasted and that the signal to noise ratio is much higher[1].

In that regard, CT image denoising is a crucial step to visualize and analyze such images. Most classic image processing algorithms have been transferred to the medical image domain, producing satisfactory results [1]. The results have been boosted by a large margin by the introduction of neural nets for image restoration. [2,3]

However, one key limitation for the interpretability of the produced method is that neural networks hallucinate patterns and textures they have seen in the training set. Therefore, these methods are not really trustworthy for radiologists.

In order to leverage the expressive power and denoising capacity of deep neural networks without recreating patterns which have been seen during training, the idea is to train on synthetic abstract images which do not directly contain the patterns observed in real CT images. That way, our network won’t be biased to reproduce expected patterns, while maintaining good performances.

Tasks

In this semester project , we will try to :

create a database of abstract dead leaves images mimicking the statistics of real images [4,5]
study the noise distribution of real CT images by using a real dataset of noisy images
train a denoising network with the simulated ground truth and noisy images
quantify the hallucinations made by the network trained on real images vs simulated images [6]
establish a test protocol for our network

Deliverables

Deliverables should include code, well cleaned up and easily reproducible, as well as a written report, explaining the models, the steps taken for the project and the performances of the models.

Prerequisites: Python and PyTorch, basics of image processing

Level: MS research project

Number of students: 1

References :

A review on CT image noise and its denoising, 2018, Manoj Diwakara, Manoj Kumarb
Low-Dose CT Image Denoising Using a Generative Adversarial Network With a Hybrid Loss Function for Noise Learning, 2020, YINJIN MA et al
Investigation of Low-Dose CT Image Denoising Using Unpaired Deep Learning Methods, 2021, Zeheng Li, Shiwei Zhou, Junzhou Huang, Lifeng Yu, and Mingwu Jin
Occlusion Models for Natural Images: A Statistical Study of a Scale-Invariant Dead Leaves Model, 2001, Ann B. Lee, David Mumford, Jinggang Huang
Synthetic images as a regularity prior for image restoration neural networks, 2021, Raphael Achddou, Yann Gousseau, Said Ladjal
Image Denoising with Control over Deep Network Hallucination, 2022, Qiyuan Liang, Florian Cassayre, Haley Owsianko, Majed El Helou, Sabine Süsstrunk

Description:

IVRL spin-off Largo.ai has developed a platform for film producers to predict the success of a film based on the script and relevant production and distribution information. In this research project, we will research the possibilities of improving a movie screenplay by using generative AI. The current system is able to run a genre analysis of the script. The developed model should alter the original screenplay according to the feedback taken from the predictive genre models.

Deliverables:

Report and running prototypes.

Prerequisites:

Coding skills in Python
Experience and Interest in Natural Language Processing
Experience or Interest in Generative AI

Level: BS or MS semester project

Supervisors: [email protected] ; [email protected]

IVRL spin-off Largo.ai provides screenplay tools for the movie industry. In this project, we will research Text-to-Speech Methods that can reflect provided emotions in the scenes of movie screenplays. A model for emotion detection will be provided.The project aims at creating a system that is able to generate audio from the dialogues of the script.In particular these features are required:

Use of different voices for different characters
Automated understanding of the emotions suggested by a given part of the dialogues
Adaptation of the generated audio to the emotions found (e.g. emphasis, speaking rate, pitch, volume)

We expect research on the existing open-source methods and adaptation of the state-of-art method for the purposes of screenplay dialogues.

Deliverables:

Report and running prototypes.

Prerequisites:

Coding skills in Python
Experience or Interest in Text-to-Speech Models

Level:

BS or MS semester project

Supervisors:

[email protected] ; [email protected]

Supervisors: [email protected] ; [email protected]

Description

Single-Image super-resolution is the task of increasing an image resolution by inferring the missing high frequencies of an image. For this task, neural networks are commonly learnt on large natural images datasets [1, 2]. While these models produce impressing results on natural images, they fail to generalize to other unseen domains, such as drone images. To tackle this issue, [3] proposed DSR, a drone super-resolution dataset along with a baseline network. Each scene is acquired at multiple heights with two focal lengths (telecamera : the ground truth HR/ normal : the LR image)on two different cameras. In this project, we will work on improving the proposed approach. Possible perspectives of improvement regard 1/ the domain gap between the two cameras, for which we may train a Camera-2-Camera mapping [4,5] 2/ exploiting images at different height to improve the current results 3/ Propose new architectures [6] 4/ Work on a lightweight distillation of the model that could work real time.

References

[1] SwinIR: Image Restoration Using Swin Transformer, 2021

[2] Zoom To Learn, Learn To Zoom, 2019

[3] DSR Towards Drone Image Super-Resolution, 2022

[4] Cross-camera convolutional color constancy 2021

[5] Semi-Supervised Raw-to-Raw Mapping, 2021

[6] Image Super-Resolution via Iterative Refinement, 2021

Deliverables

Project report and reproducible code.

Prerequisites

Experience with Python, Pytorch, preferably some expertise in Image processing and machine learning.

Level

MS semester project

Type of Work

60 % research 40 % implementation

Supervisors

Raphael Achddou ([email protected])

Description

Low-light image enhancement is a critical image restoration task for camera/cameraphones manufacturers. The classic approaches consist in either exploiting bursts of photos and applying collaborative filtering to the acquired stack of images. however it is time consuming, and not very efficient. In 2018 [1] proposed to learn the full Image processing pipeline for RAW low light images along with a large RAW image dataset. While this work was impactful, acquiring such dataset is a tedious and time consuming task. To tackle this issue, [2] propose to synthesize the distortion with a complex but physically sound distortion model. However this paper assumes that the noise distribution is the same in every pixel, while it is very likely that noise is spatially varying for a specific sensor. Another approach would consist in learning the noise distribution by means of a generative network. While this has been successfully done for standard RAW noise [3,4,5], no particular effort has been dedicated for extreme low-light settings. In this project, we will try to apply these techniques to model extreme low-light noise, and try out diffusion models for noise generation [6]. We will validate our noise model by training a low light image enhancement model as well as by computing statistical scores.

References

[1] Learning to see in the dark, 2018

[2] A physics-based noise formation model for extreme low-light raw denoising 2020

[3] Modeling sRGB Camera Noise with Normalizing Flows, 2022

[4] C2n: Practical generative noise modeling for real-world denoising, 2021

[5] Noise Modeling with Conditional Normalizing Flows, 2019

[6]Realistic Noise Synthesis with Diffusion Models,2023

Deliverables

Project report and reproducible code.

Prerequisites

Experience with Python, Pytorch, preferably some expertise in Image processing and machine learning.

Level

MS semester project

Type of Work

60 % research 40 % implementation

Supervisors

Raphael Achddou ([email protected])

Description:

Startup company Innoview Sàrl has developed software to recover by smartphone a watermark hidden into a grayscale image. The present project aims at improving the registration precision between an image captured with a smartphone and a reference image The robustness to changes in the acquisition conditions should also be improved. Registrations carried out with different types of registration symbols or patterns are to be characterized and compared.

Deliverables: Report and running prototypes (Android and possibly PC).

Prerequisites:

– knowledge of image processing / computer vision

– Coding skills in Java Android and/or Matlab

Level: BS or MS semester project

Supervisors:

Dr Romain Rossier, Innoview Sàrl, [email protected], , tel 078 664 36 44

Prof. Roger D. Hersch, BC110, [email protected], cell: 077 406 27 09

Description:

Supervised image segmentation task requires strict class definitions and costly annotations from humans. Since annotation is costly, supervision is usually limited to small datasets. Methods that are trained on small datasets do not generalize to unseen classes and unseen image domains. Recent advances in vision foundation models [1,2] and self-supervised learning [3] provide semantically rich features in large datasets. Even though the class definitions and annotations are limited or missing, these methods [1,2,3] can be used with unsupervised learning algorithms to solve the segmentation task without requiring segmentation annotations. Our aim in this project is to develop and use unsupervised learning methods to segment images from various domains (i.e. synthetic and real image domains).

References:
[1] Radford, Alec, et al. “Learning transferable visual models from natural language supervision.” International conference on machine learning. PMLR, 2021
[2] Kirillov, Alexander, et al. “Segment anything.” arXiv preprint arXiv:2304.02643 (2023).
[3] Melas-Kyriazi, Luke, et al. “Deep spectral methods: A surprisingly strong baseline for unsupervised semantic segmentation and localization.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022.

Deliverables: Report and reproducible implementations
Prerequisites: Experience with Deep Learning, Computer Vision, PyTorch
Level: MS semester project
Type of work: 40% research, 60% implementation
Supervisors: Baran Ozaydin ([email protected])

Description:

Startup company Innoview Sàrl has developed software for creating compact custom codes enabling distinguishing between an original and a counterfeited document. In order to detect counterfeits, the code’s parameters need to be adapted to the printing environment: paper type, printer type, printer resolution and ink type. The goal of the project is to find the best parameters for different kinds of printing environments.

Deliverables: Report and running prototype (Matlab).

Prerequisites:

– knowledge of image processing / computer vision

– basic coding skills in Matlab

Level: BS or MS semester project

Supervisor:

Prof. Roger D. Hersch, BC 110, [email protected], cell: 077 406 27 09

Creative images generated by dall-e-2.

3D objects created with DreamFusion taking prompts. (DreamFusion, Ben Poole et al.) This project will be in this line of work and aims to improve the quality of generated 3D contents.

Description

In recent years, a novel class of generative models known as Diffusion Models (DMs) has emerged. These models define a forward process that introduces a small amount of Gaussian noise into data samples, and a learnable reverse process (generation process) that gradually removes the noise. When applied to natural images, this modeling approach has been shown to outperform state-of-the-art Generative Adversarial Nets (GANs).

Though 2D images are natural for screens and printing, it would be nice if we could generate 3D content that can be seen from different views. However, collecting 3D data is tedious, unlike 2D photos that can be captured daily using smartphones and naive scaling 2D models to 3D is usually computationally infeasible.

In this project, we will use 2D image supervision for 3D content creation. We will guide 3D content creation with several 2D supervision, through several methods, such as rasterization, and volume rendering.

Stable diffusion: https://github.com/Stability-AI/stablediffusion
3D content: https://dreamfusion3d.github.io/

Deliverables

Code, well cleaned up and easily reproducible.
Written Report, explaining the literature and steps taken for the project.

Prerequisites

Python and PyTorch.
Experience with Deep Learning methods and Convolutional Networks

Level: Master Student

Supervisor: Yufan Ren (website), Email: [email protected]

Motivated students are welcome to join our team for the next DeepFakes detection competition (August-September) at ICCV’23. Photo at the ceremony of the Trusted Media DeepFakes detection challenge in 2022. We also hosted the DeepFakes stand at the EPFL’s Open Day.

Description:

DeepFakes are subsets of fake images (especially face images) synthesized with Machine Learning algorithms. First appearing only five years ago, DeepFake generation technology has been evolving in several aspects: quality, speed, and data efficiency. As facial expression and identity are the central parts of social interaction and trust in media, there is a persistent interest in developing robust and efficient DeepFake detection algorithms.

DeepFake videos are DeepFake images’ video counterparts. Moving from images to videos opens up several new possibilities (as what we witnessed in image understanding to video understanding). For instance, video as a time series data offers consistency between frames but faces in DeepFake videos might not move as usual as in real videos.

In this project, you will first briefly review the state-of-the-art DeepFake video detection methods. Then based on your review, you will reproduce/re-implement a representative subset of them and test them on the FaceForensic++ dataset. Finally, your result should come with quantitative and qualitative comparisons.

For inquiries, feel free to drop an email : )

Level of work

MS Level, semester project / master project

Prerequisite:

Knowledge in deep learning frameworks (e.g., PyTorch or Tensorflow), image processing and Computer Vision

Supervisor:

Yufan Ren, website, [email protected]

Type of work:

50% research, 50% development

Deliverables:

Code, well-cleaned up and easily reproducible
Written Report, explaining the literature and steps taken for the project

References:

[1] FaceForensics++: Learning to Detect Manipulated Facial Images

[2] Deepfake Video Detection through Optical Flow Based CNN

[3] Countering Malicious DeepFakes: Survey, Battleground, and Horizon

Introduction:

Recent work on text to 2D models such as Stable Diffusion trained on billions of text-image pairs have achieved stunning image synthesis quality. However, the model requires large scale datasets and industrial sized training models, a process that can be hard to be taken to 3D scenes generation.

To deal with this problem DreamFusion as the first work taking Diffusion models to 3D generation with facilitation of NeRF is able to accelerate this process and create view-consistent scene objects with mesh models for generalised forward rendering pipeline. See https://dreamfusion3d.github.io/ for a collection of generated results.

This project will focus on text-driven 3D content generation, looking at possibilities of exploiting pretrained 2D diffusion models or other similar architecture with 3D vision as priors to accomplish the following tasks 1. Create larger scaled scenes with realistic environment surroundings; 2. Generate specific materials with particular viewing effects that require 3D understanding of scene intrinsics such as transparent objects; 3. Editable texture / stylization for scene meshes controlled via text-input . We will be looking at CLIP and diffusion based models. A work close to this project can be found in Text2Mesh in the reference section.

Type of Work:

MS Level: semester project / master project
60% research, 40% development
We encourage collaboration / teamwork so this project can take 2 students working on different aspects of the research problem.

Supervisor:

Dongqing Wang, [email protected]

Prerequisite:

Have taken a Machine Learning course and a Computer Vision course.
Have sufficient Pytorch knowledge.
Students will be working extensively with 3D scene generation and possibly editing, therefore prior experience with 3D vision and / or computer graphics knowledge will be a plus.
Experience with diffusion models and / or CLIP will be a plus.

Reference Literature:

Poole, Ben, et al. “Dreamfusion: Text-to-3d using 2d diffusion.” arXiv preprint arXiv:2209.14988 (2022).
Chen, Yongwei, et al. “TANGO: Text-driven Photorealistic and Robust 3D Stylization via Lighting Decomposition.” arXiv preprint arXiv:2210.11277 (2022).
Rombach, Robin, et al. “High-resolution image synthesis with latent diffusion models.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022.
Radford, Alec, et al. “Learning transferable visual models from natural language supervision.” International Conference on Machine Learning. PMLR, 2021.
Michel, Oscar, et al. “Text2mesh: Text-driven neural stylization for meshes.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022.
Haque, Ayaan, et al. “Instruct-NeRF2NeRF: Editing 3D Scenes with Instructions.”arXiv preprint arXiv:2303.12789(2023).
https://github.com/ashawkey/stable-dreamfusion
https://github.com/CompVis/stable-diffusion

Learned Neural Radiance Field as scene representation showing nice geometry reconstruction but requires >30 images. (NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis, Ben Mildenhall et al. )

Generalizable neural representation works with as few as three input views using generalizable priors. (SparseNeuS: Fast Generalizable Neural Surface Reconstruction from Sparse Views, Long et al.) This project would be in this research line and aims to improve further.

Our previous work in this line of research. Aims at achieving finer details. Will be presented on CVPR’23.

Description :

Neural Rendering is a branch of rendering technique that replaces one or more parts of the rendering pipeline with neural networks. The inductive bias of neural networks has been proven helpful in many Neural Rendering tasks, such as novel view synthesis (NeRF), surface reconstruction (volSDF), and material acquisition (NeRD).

In this project, we are interested in the generalization ability of Neural Rendering, e.g., given a set of images, infer novel views directly. There are several benefits of using generalizable features. Firstly, we skip the long training procedure with a fast-forward inference. Secondly, learnable features are beneficial in sparse input cases. Thirdly, the framework enables further optimization to improve quality.

We will build a generalizable rendering model based on existing network designs such as IBRNet and MVSNet, and train using pixel loss, depth loss, and patching warping loss.

Level

MS Level: semester project/master project

Prerequisite:

Knowledge in deep learning frameworks (e.g., PyTorch), image processing, and Computer Vision.

Supervisors:

Yufan Ren (website), Email: [email protected]

Type of work:

50% research, 50% development

References:

[1] NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis

[2] IBRNet: Learning Multi-View Image-Based Rendering

[3] Stable view synthesishttps://arxiv.org/abs/2011.07233)

Description:

Startup company Innoview Sàrl has developed software to recover by smartphone a watermark hidden into a grayscale image. The software needs to be adapted to run in C++ on diverse platforms, such as Android and PC. Performance tests need to be performed for different printing devices and parameter settings.

Deliverables: Report and running prototypes (Android and PC).

Prerequisites:

– knowledge of image processing / computer vision

– Coding skills in C++ and Java Android

Level: BS or MS semester project

Supervisors:

Dr Romain Rossier, Innoview Sàrl, [email protected], , tel 078 664 36 44

Prof. Roger D. Hersch, BC110, [email protected], cell: 077 406 27 09

Description:

Startup company Innoview Sàrl has developed software to recover watermarks hidden into grayscale images. Appropriate settings of parameters enable the detection of counterfeits. The goal of the project is to define optimal parameters for different sets of printing conditions (resolution, type of paper, type printing device, complexity of hidden watermark, etc..). The project involves tests on a large data set and appropriate statistics.

Deliverables: Report and running prototype (Android, Matlab).

Prerequisites:

– knowledge of image processing / computer vision
– basic coding skills in Matlab and Java Android

Level: BS or MS semester project

Supervisors:

Dr Romain Rossier, Innoview Sàrl, [email protected], , tel 078 664 36 44
Prof. Roger D. Hersch, BC110, [email protected], cell: 077 406 27 09

Description:

Learning joint representation for language and vision [1], generating image from text [2], and describing an image via text [3] has led to interesting results and research applications. However, these methods operate on a global level, i.e. a generated image matches a sentence or a generated text matches an image. Our goal in this project is to describe smaller regions in an image by words. We will be using the state-of-the-art diffusion models and language pretrained vision models to represent an image as words.

References:
[1] Radford, Alec, et al. “Learning transferable visual models from natural language supervision.” International conference on machine learning. PMLR, 2021.

[2] Rombach, Robin, et al. “High-resolution image synthesis with latent diffusion models.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022.

[3] Gal, Rinon, et al. “An image is worth one word: Personalizing text-to-image generation using textual inversion.” arXiv preprint arXiv:2208.01618 (2022).

Deliverables: Report and reproducable implementations
Prerequisites: Experience with Deep Learning, Computer Vision, PyTorch
Level: MS semester project
Type of work: 50% research, 50% implementation
Supervisors: Baran Ozaydin ([email protected])

Here are a few references to explore:

Growing Neural Cellular Automata:https://distill.pub/2020/growing-ca/

Self-Organizing Textures:https://distill.pub/selforg/2021/textures/

DyNCA: Real-Time Dynamic Texture Synthesis Using Neural Cellular Automata:

https://dynca.github.io/

Self-classifying MNIST Digits:https://distill.pub/2020/selforg/mnist/

Deliverables:

Code, well cleaned up and easily reproducible.
Written Report, explaining the literature and steps taken for the project.

Prerequisites:

Python and PyTorch.
Experience with Deep Learning methods and Convolutional Networks

Level: M.Sc. and B.Sc.

Supervisor: Ehsan Pajouheshgar (ehsan.pajouheshgar [at] epfl [dot] ch)

Description:

Diffusion models recently became state-of-the-art for generating images. Yet, some attributes are more challenging to generate than others, e.g., hands might be deformed with a wrong number of fingers, objects may have weird ends, objects can appear merged or duplicated, etc [1, 2]. In this project, you will research the existing literature for reducing such artifacts and therefore for improving the quality of images generated by diffusion models.

Tasks:

– Is this image generated with a diffusion model? Recently, several works try to detect images generated by diffusion models [3, 4, 5]. A first step to start this project is to train a model to detect diffusion-generated images. You should ideally use the concepts of saliency map / class-activation map to have results with visual explanations, highlighting areas of the image that contain diffusion artifacts.

– Improving generated images. The main goal of this project is to improve the images generated with diffusion models. A possible solution is to use image-to-image translation methods [6, 7], translating images from the “generated images” domain to the “real images” domain. To further improve the results, you will propose a method that includes diffusion-specific architecture details, i.e., knowing that images are created gradually from noise.

References:

[1] See the channel “failed-diffusion” on the Stable Diffusion Discord, where people share their failures.

[2] Tangermann, V. (2022, October 24). We’re cry-laughing at these “spectacular failures” of ai generated art. Futurism. From https://futurism.com/the-byte/ai-generated-art-failures

[3] Coccomini, D. A., Esuli, A., Falchi, F., Gennaro, C., & Amato, G. (2023). Detecting Images Generated by Diffusers. arXiv preprint arXiv:2303.05275.

[4] Corvi, R., Cozzolino, D., Zingarini, G., Poggi, G., Nagano, K., & Verdoliva, L. (2022). On the detection of synthetic images generated by diffusion models. arXiv preprint arXiv:2211.00680.

[5] Bird, J. J., & Lotfi, A. (2023). CIFAKE: Image Classification and Explainable Identification of AI-Generated Synthetic Images. arXiv preprint arXiv:2303.14126.

[6] Zhu, J. Y., Park, T., Isola, P., & Efros, A. A. (2017). Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision (pp. 2223-2232).

[7] Alami Mejjati, Y., Richardt, C., Tompkin, J., Cosker, D., & Kim, K. I. (2018). Unsupervised attention-guided image-to-image translation. Advances in neural information processing systems, 31.

Deliverables: At the end of the semester, the student should provide a framework for detecting diffusion-generated images and for improving the quality of diffusion-generated images. Deliverables should include code, well cleaned up and easily reproducible, as well as a written report, explaining the models, the steps taken for the project and the performances of the models.

Prerequisites: Python and PyTorch.

Level: MS research project

Number of students: 1

Supervisor: Martin Nicolas Everaert (martin.everaert [at] epfl.ch)

Image-based rendering can date back to the 1990s. Unlike traditional Computer Graphics rendering, which requires explicit scene geometry and scene texture, image-based rendering renders a scene based on observations of the scene, i.e., photographs taken in the real/synthesised scene.

Given an image collection of a scene under different viewing directions, the method of NeRF can faithfully synthesise novel views that are 3D-consistent. However, it is still an open question of how to render the scene under a novel lighting condition.

There are a few works about relighting a NeRF-like implicit scene representation. There are two steps for this process: scene decomposition and 3D scene extraction from the decomposition. Afterwards, we can relight the 3D representation easily. A common material model assumed for all objects of interests is the Disney BSDF model. However, there are a variety of other materials such as glass or glints that cannot be adequately represented by this assumption.

In this project, we work with a dedicated type of surface appearance and explore various 3D scene representations, either implicit or explicit, and develop corresponding scene decomposition strategies which would then enable scene novel view inquires and relighting.

Type of work

MS Level: semester project / master project ** can be adapted to bachelor project in case of a capable bachelor student.
60% research, 40% development

Prerequisite:

Knowledge in deep learning frameworks (e.g., PyTorch or Tensorflow), image processing and Computer Vision.
Experience with 3D vision is required.
Knowledge with Computer Graphics is recommended.

Supervisor:

Dongqing Wang, [email protected]

Reference Literature:

NeRF: Neural Radiance Field https://www.matthewtancik.com/nerf
NeRV: Neural Reflectance and Visibility Fields for Relighting and View Synthesis https://pratulsrinivasan.github.io/nerv/
Neural Radiance Fields for Outdoor Scene Relighting https://4dqv.mpi-inf.mpg.de/NeRF-OSR/
NeRFactor: Neural Factorization of Shape and Reflectance Under an Unknown Illumination https://arxiv.org/abs/2106.01970

Comics Projects

Visual computing on comics is a very exciting yet very challenging problem because of the lack of annotated data, domain shifts, and intra-domain variability. To tackle this problem, we aim to use segmentation, saliency detection, depth estimation along with style transfer. In the projects that follow you will broaden your skills by understanding how to use state-of-the-art computer vision algorithms on a customized comics dataset. We aim to publish the results in the upcoming venues.

Lack of annotated data

The lack of annotated data in the comics domain makes it difficult to train machine learning models for vision tasks such as image classification, segmentation, depth prediction, and saliency prediction. Without annotated data, it is challenging for the model to learn the essential features and patterns to perform the task accurately.

Domain gap

The comics domain has unique properties that are not present in the training data (natural images), resulting in a domain gap between the training data and the comics data. This leads to poor performance and difficulty to generalize when the model is applied to comics data.

Variability within the domain

Different artists and publishers often use unique visual styles and techniques while producing comics. For instance, one artist may represent a chair substantially differently than another, resulting in a variance in the object’s appearance. This type of discrepancy is uncommon in images of nature. Therefore, it is challenging for a model to accurately perform tasks such as image classification or segmentation by grouping objects that seem to belong to distinct categories.

Description:

Salient regions are the areas that stand out compared to their surroundings. Saliency prediction aims to predict which areas attract the most attention. However, unlike image classification, the saliency prediction task lacks large-scale annotated datasets. Current approaches for saliency estimation use eye tracking data on natural images for constructing ground truth. However, in our project we will perform saliency detection on comics pages instead of natural images.

In this project, you will explore the existing saliency prediction methods on comics stimuli. You will test and improve a method proposed for saliency prediction using inter-object relationships. Then, you will evaluate your model by comparing to the state-of-the-art saliency detection models.

Tasks:

– Understand the literature and state-of-art

– Test existing methods for saliency prediction on comics

– Improve a saliency prediction method proposed by using object-character interactions

– Evaluate state-of-the-art saliency detection models on comics data

Prerequisites:

Experience in machine learning and computer vision, experience in Python, experience in deep learning frameworks

Deliverables:

Reproducible code and a written report

Level:

MS semester or thesis project

Type of work:

60% research, 40% development and testing

References:

“DeepComics: Saliency estimation for comics”, Kevin Bannier, Eakta Jain, Olivier Le Meur

Khetarpal and E. Jain, “A preliminary benchmark of four saliency algorithms on comic art,” 2016 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), Seattle, WA.

Daniel V. Ruiz and Bruno A. Krinski and Eduardo Todt, IDA: Improved Data Augmentation Applied to Salient Object Detection, 2020 SIBGRAPI

Supervisor:

Bahar Aydemir ([email protected])

Description:

Visual saliency refers to a part in a scene that captures our attention. Current approaches for saliency estimation use eye tracking data on natural images for constructing ground truth. However, in our project we will perform analysis of eye tracking data on comics pages instead of natural images. Later, we will use the collected data to estimate saliency in comics domain.

Tasks:
– Understand the key points of an eye tracking experiment and our setup.

– Conduct an analysis of eye tracking data according to given instructions.

Deliverables: At the end of the semester, the student should provide the code and a report of the work.

Type of work: 20% research, 80% development and testing

References:

[1] A. Borji and L. Itti, “Cat2000: A large scale fixation dataset for boosting saliency research,” CVPR 2015 workshop on ”Future of Datasets”, 2015.

[2] Kai Kunze , Yuzuko Utsumi , Yuki Shiga , Koichi Kise , Andreas Bulling, I know what you are reading: recognition of document types using mobile eye tracking, Proceedings of the 2013 International Symposium on Wearable Computers, September 08-12, 2013, Zurich, Switzerland.

[3] K. Khetarpal and E. Jain, “A preliminary benchmark of four saliency algorithms on comic art,” 2016 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), Seattle, WA.

Level: BS semester project

Supervisor: Bahar Aydemir ([email protected])

It is well known that deep learning models need to be trained with a huge amount of data. To circumvent this limitation if only a small dataset is available, models are pre-trained on large datasets and then fine-tuned on the few available labels from the target domain. The models are usually trained on ImageNet or even larger datasets with labeled or in a self-supervised manner. Although those models provide strong baselines, the performance of different training schemas varies in different scenarios and there is no guideline for choosing the model to use.

In our project, we will investigate this problem with a novel comic dataset. We explore a large array of differently pre-trained models with different datasets, such as supervised and diverse self-supervised models, to then fine-tune them with our own labeled comic data. Through thorough analysis of the results, we will provide insights into finding the best model for transfer learning. At the end of the semester, you should therefore be able to give an overview of the existing transfer-learning methods, how these work applied on our comic dataset, and theorize on their applicability beyond a single domain

Task:

– Evaluate differently pre-trained models on comic data.

– Pre-train the model in a self-supervised learning manner on comic data directly.

– Establish the possibilities and limitations of transfer learning on comic data and other domains.

Prerequisites:

Having knowledge on python, deep learning framework (tensorflow or pytorch), and linear algebra is required.

Level:

MS project

Type of work:

20% literature review, 40% research, 40% development and test.

References:

[1] Rankme: Assessing the downstream performance of pretrained self-supervised representations by their trank , Quentin Garrido, Randall Balestriero, Laurent Najman, Yann LeCun

[2]. Dive into Self-Supervised Learning for Medical Image Analysis: Data, Models and Tasks , Chuyan Zhanga, Yun Gua

[3]. Rethinking ImageNet Pre-training, Kaiming He, Ross Girshick, Piotr Dollár

[4]. Masked Autoencoders are Scalable Vision Learner, Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick

Supervisors: Peter Grönquist ([email protected]) , Tong Zhang([email protected])

Description (Master Semester Project open to EPFL students)

In this project, you will do a comparative analysis on the existing literature for monocular depth estimation while applying them on comics images. The state-of-the-art monocular depth estimation methods [1] [2] [3] are applied to natural photographs but when they are applied to comics images they severely underperform. In this project you would a] apply all the existing depth methods directly on the comics images and b] use an existing translation method to translate the comics images to real ones, and then apply all the existing depth baselines [4]. Following this, you will analyse the results qualitatively and quantitatively.
You may contact the supervisor at any time should you want to discuss the idea further.

References
[1] Monocular Depth Estimation: A Survey https://arxiv.org/abs/1901.09402
[2] Monocular Depth Estimation Based On Deep Learning: An Overview https://arxiv.org/abs/2003.06620
[3] Monocular Depth Estimation Using Deep Learning: A Review https://www.mdpi.com/1424-8220/22/14/5353/pdf
[3] Estimating Image Depth in the Comics Domain https://arxiv.org/abs/2110.03575

Type of Work (e.g., theory, programming)
30% research, 70% implementation and testing

Prerequisites
Good experience in deep learning, experience in Python, Pytorch. Experience in statistical analysis (finding absolute errors or RMSE between the prediction and the ground-truth) to report the performance evaluations of the models.
Models will run on RunAI. (We will guide you how to use RunAI- no prior knowledge required).

Supervisor
Deblina BHATTACHARJEE ([email protected])

Description (Master Semester Project or Master Thesis Project open to EPFL students)

In this project, you will research the existing literature on multi-task learning and self-supervised approaches [refer to 1]. Further you will read literature about how to use these two in conjunction, and then implement the method [refer to 2]. You will have to think of a novel idea to improve on this method for which you will receive guidance from your supervisor. Thereafter you will evaluate your method against the existing baselines. In doing so, you will learn about the various approaches to improve joint-training on dense vision tasks.

Bonus (applicable to Master thesis students only): We will work with transformer frameworks and working on the explainability/ interpretability of transformers will fetch you extra points.

You may contact the supervisor at any time should you want to discuss the idea further.

Reference

[1] Multi-Task Learning for Dense Prediction Tasks: A Survey, Simon Vandenhende et.al.

[2] Multi-task Self-Supervised Visual Learning, Doersch and Zisserman.

Type of Work (e.g., theory, programming)

50% research, 50% development and testing

Prerequisites

Experience in deep learning, experience in Python, Pytorch. Experience in statistical analysis to report the performance evaluations of the models.

Models will run on RunAI. (We will guide you how to use RunAI- no prior knowledge required).

Supervisor(s)

Deblina BHATTACHARJEE ([email protected])

Description: Diffusion models recently became state-of-the-art for generating images. These models are trained as follows: images from the training set are deteriorated with Gaussian noise at different levels, and the model is tasked to gradually denoise these images. At inference time, to generate an image, one starts with noise et gradually denoises it. In the particular case of generating images with text-to-image diffusion models, one commonly relies on classifier-free guidance [1] to improve alignment of the generated image with the textual prompt. Another possible way to improve this text-image alignment is to use CLIP-guidance [2, “Classifier-guidance” in 3]. The first solution, classifier-free guidance, requires to do 2 forward passes in each inference step, one with the textual prompt and one without, which doubles the time needed to generate an image. The second solution, CLIP-guidance, is rarely used in practice with text-to-image diffusion models, because it requires to do additional forward and backward passes through the CLIP model, which makes the image generation significantly slower, e.g., 40 seconds instead of 5 seconds. In this project, you will use these guidance terms at training time instead of inference time. You will therefore fine-tune diffusion models with additional terms in the loss.

Tasks:

– Distillation of classifier-free guidance. You will propose modifications to the training loop of diffusion models to distill the classifier-free guidance at training time instead of inference time. You will then fine-tune a pretrained diffusion model with these modifications and evaluate the results along several axes: generation time, image quality, and text-image alignment.

– Feed-forward CLIP-guided diffusion. You will propose modifications to the training loop of diffusion models to account for CLIP-guidance at training time. You will then fine-tune a pretrained diffusion model with these modifications and evaluate the results along several axes: generation time, image quality, and text-image alignment.

References:

[1] Ho, J., & Salimans, T. (2022). Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598.

[2] Suraj Patil (2022). CLIP guidance with Stable Diffusion. From https://github.com/huggingface/diffusers/blob/main/examples/community/clip_guided_stable_diffusion.py

[3] Dhariwal, P., & Nichol, A. (2021). Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems, 34, 8780-8794.

Deliverables: At the end of the semester, the student should provide a framework to fine-tune and evaluate text-to-image models with feed-forward guidance. Deliverables should include code, well cleaned up and easily reproducible, as well as a written report, explaining the models, the steps taken for the project and the performances of the models.

Prerequisites: Python and PyTorch.

Level: BS or MS research project

Number of students: 1

Supervisor: Martin Nicolas Everaert (martin.everaert [at] epfl.ch)