Research Projects are open to EPFL students.
Description:
- Coding skills in Python
- Experience and Interest in Natural Language Processing
- Experience or Interest in Generative AI
- Use of different voices for different characters
- Automated understanding of the emotions suggested by a given part of the dialogues
- Adaptation of the generated audio to the emotions found (e.g. emphasis, speaking rate, pitch, volume)
- Coding skills in Python
- Experience or Interest in Text-to-Speech Models
Supervisors: [email protected] ; [email protected]
Description
Single-Image super-resolution is the task of increasing an image resolution by inferring the missing high frequencies of an image. For this task, neural networks are commonly learnt on large natural images datasets [1, 2]. While these models produce impressing results on natural images, they fail to generalize to other unseen domains, such as drone images. To tackle this issue, [3] proposed DSR, a drone super-resolution dataset along with a baseline network. Each scene is acquired at multiple heights with two focal lengths (telecamera : the ground truth HR/ normal : the LR image)on two different cameras. In this project, we will work on improving the proposed approach. Possible perspectives of improvement regard 1/ the domain gap between the two cameras, for which we may train a Camera-2-Camera mapping [4,5] 2/ exploiting images at different height to improve the current results 3/ Propose new architectures [6] 4/ Work on a lightweight distillation of the model that could work real time.
References
[1] SwinIR: Image Restoration Using Swin Transformer, 2021
[2] Zoom To Learn, Learn To Zoom, 2019
[3] DSR Towards Drone Image Super-Resolution, 2022
[4] Cross-camera convolutional color constancy 2021
[5] Semi-Supervised Raw-to-Raw Mapping, 2021
[6] Image Super-Resolution via Iterative Refinement, 2021
Deliverables
Project report and reproducible code.
Prerequisites
Experience with Python, Pytorch, preferably some expertise in Image processing and machine learning.
Level
MS semester project
Type of Work
60 % research 40 % implementation
Supervisors
Raphael Achddou ([email protected])
Description
Low-light image enhancement is a critical image restoration task for camera/cameraphones manufacturers. The classic approaches consist in either exploiting bursts of photos and applying collaborative filtering to the acquired stack of images. however it is time consuming, and not very efficient. In 2018 [1] proposed to learn the full Image processing pipeline for RAW low light images along with a large RAW image dataset. While this work was impactful, acquiring such dataset is a tedious and time consuming task. To tackle this issue, [2] propose to synthesize the distortion with a complex but physically sound distortion model. However this paper assumes that the noise distribution is the same in every pixel, while it is very likely that noise is spatially varying for a specific sensor. Another approach would consist in learning the noise distribution by means of a generative network. While this has been successfully done for standard RAW noise [3,4,5], no particular effort has been dedicated for extreme low-light settings. In this project, we will try to apply these techniques to model extreme low-light noise, and try out diffusion models for noise generation [6]. We will validate our noise model by training a low light image enhancement model as well as by computing statistical scores.
References
[1] Learning to see in the dark
, 2018
[2] A physics-based noise formation model for extreme low-light raw denoising
2020
[3] Modeling sRGB Camera Noise with Normalizing Flows, 2022
[4] C2n: Practical generative noise modeling for real-world denoising, 2021
[5] Noise Modeling with Conditional Normalizing Flows, 2019
[6]Realistic Noise Synthesis with Diffusion Models,2023
Deliverables
Project report and reproducible code.
Prerequisites
Experience with Python, Pytorch, preferably some expertise in Image processing and machine learning.
Level
MS semester project
Type of Work
60 % research 40 % implementation
Supervisors
Raphael Achddou ([email protected])
Description:
Startup company Innoview Sàrl has developed software to recover by smartphone a watermark hidden into a grayscale image. The present project aims at improving the registration precision between an image captured with a smartphone and a reference image The robustness to changes in the acquisition conditions should also be improved. Registrations carried out with different types of registration symbols or patterns are to be characterized and compared.
Deliverables: Report and running prototypes (Android and possibly PC).
Prerequisites:
– knowledge of image processing / computer vision
– Coding skills in Java Android and/or Matlab
Level: BS or MS semester project
Supervisors:
Dr Romain Rossier, Innoview Sàrl, [email protected], , tel 078 664 36 44
Prof. Roger D. Hersch, BC110, [email protected], cell: 077 406 27 09
Description:
Supervised image segmentation task requires strict class definitions and costly annotations from humans. Since annotation is costly, supervision is usually limited to small datasets. Methods that are trained on small datasets do not generalize to unseen classes and unseen image domains. Recent advances in vision foundation models [1,2] and self-supervised learning [3] provide semantically rich features in large datasets. Even though the class definitions and annotations are limited or missing, these methods [1,2,3] can be used with unsupervised learning algorithms to solve the segmentation task without requiring segmentation annotations. Our aim in this project is to develop and use unsupervised learning methods to segment images from various domains (i.e. synthetic and real image domains).
References:
[1] Radford, Alec, et al. “Learning transferable visual models from natural language supervision.” International conference on machine learning. PMLR, 2021
[2] Kirillov, Alexander, et al. “Segment anything.” arXiv preprint arXiv:2304.02643 (2023).
[3] Melas-Kyriazi, Luke, et al. “Deep spectral methods: A surprisingly strong baseline for unsupervised semantic segmentation and localization.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022.
Deliverables: Report and reproducible implementations
Prerequisites: Experience with Deep Learning, Computer Vision, PyTorch
Level: MS semester project
Type of work: 40% research, 60% implementation
Supervisors: Baran Ozaydin ([email protected])
Description:
Startup company Innoview Sàrl has developed software for creating compact custom codes enabling distinguishing between an original and a counterfeited document. In order to detect counterfeits, the code’s parameters need to be adapted to the printing environment: paper type, printer type, printer resolution and ink type. The goal of the project is to find the best parameters for different kinds of printing environments.
Deliverables: Report and running prototype (Matlab).
Prerequisites:
– knowledge of image processing / computer vision
– basic coding skills in Matlab
Level: BS or MS semester project
Supervisor:
Prof. Roger D. Hersch, BC 110, [email protected], cell: 077 406 27 09

Creative images generated by dall-e-2.
3D objects created with DreamFusion taking prompts. (DreamFusion, Ben Poole et al.) This project will be in this line of work and aims to improve the quality of generated 3D contents.
Description
In recent years, a novel class of generative models known as Diffusion Models (DMs) has emerged. These models define a forward process that introduces a small amount of Gaussian noise into data samples, and a learnable reverse process (generation process) that gradually removes the noise. When applied to natural images, this modeling approach has been shown to outperform state-of-the-art Generative Adversarial Nets (GANs).
Though 2D images are natural for screens and printing, it would be nice if we could generate 3D content that can be seen from different views. However, collecting 3D data is tedious, unlike 2D photos that can be captured daily using smartphones and naive scaling 2D models to 3D is usually computationally infeasible.
In this project, we will use 2D image supervision for 3D content creation. We will guide 3D content creation with several 2D supervision, through several methods, such as rasterization, and volume rendering.
- Stable diffusion: https://github.com/Stability-AI/stablediffusion
- 3D content: https://dreamfusion3d.github.io/
Deliverables
- Code, well cleaned up and easily reproducible.
- Written Report, explaining the literature and steps taken for the project.
Prerequisites
- Python and PyTorch.
- Experience with Deep Learning methods and Convolutional Networks
Level: Master Student
- Supervisor: Yufan Ren (website), Email: [email protected]

Motivated students are welcome to join our team for the next DeepFakes detection competition (August-September) at ICCV’23. Photo at the ceremony of the Trusted Media DeepFakes detection challenge in 2022. We also hosted the DeepFakes stand at the EPFL’s Open Day.
Description:
DeepFakes are subsets of fake images (especially face images) synthesized with Machine Learning algorithms. First appearing only five years ago, DeepFake generation technology has been evolving in several aspects: quality, speed, and data efficiency. As facial expression and identity are the central parts of social interaction and trust in media, there is a persistent interest in developing robust and efficient DeepFake detection algorithms.
DeepFake videos are DeepFake images’ video counterparts. Moving from images to videos opens up several new possibilities (as what we witnessed in image understanding to video understanding). For instance, video as a time series data offers consistency between frames but faces in DeepFake videos might not move as usual as in real videos.
For inquiries, feel free to drop an email : )
Level of work
-
MS Level, semester project / master project
Prerequisite:
-
Knowledge in deep learning frameworks (e.g., PyTorch or Tensorflow), image processing and Computer Vision
Supervisor:
-
Yufan Ren, website,
Type of work:
-
50% research, 50% development
Deliverables:
-
Code, well-cleaned up and easily reproducible
-
Written Report, explaining the literature and steps taken for the project
References:
[1]
[2]
[3]
Introduction:
Recent work on text to 2D models such as Stable Diffusion trained on billions of text-image pairs have achieved stunning image synthesis quality. However, the model requires large scale datasets and industrial sized training models, a process that can be hard to be taken to 3D scenes generation.
To deal with this problem DreamFusion as the first work taking Diffusion models to 3D generation with facilitation of NeRF is able to accelerate this process and create view-consistent scene objects with mesh models for generalised forward rendering pipeline. See https://dreamfusion3d.github.io/ for a collection of generated results.
This project will focus on text-driven 3D content generation, looking at possibilities of exploiting pretrained 2D diffusion models or other similar architecture with 3D vision as priors to accomplish the following tasks 1. Create larger scaled scenes with realistic environment surroundings; 2. Generate specific materials with particular viewing effects that require 3D understanding of scene intrinsics such as transparent objects; 3. Editable texture / stylization for scene meshes controlled via text-input . We will be looking at CLIP and diffusion based models. A work close to this project can be found in Text2Mesh in the reference section.
Type of Work:
- MS Level: semester project / master project
- 60% research, 40% development
- We encourage collaboration / teamwork so this project can take 2 students working on different aspects of the research problem.
Supervisor:
- Dongqing Wang, [email protected]
Prerequisite:
- Have taken a Machine Learning course and a Computer Vision course.
- Have sufficient Pytorch knowledge.
- Students will be working extensively with 3D scene generation and possibly editing, therefore prior experience with 3D vision and / or computer graphics knowledge will be a plus.
- Experience with diffusion models and / or CLIP will be a plus.
Reference Literature:
- Poole, Ben, et al. “Dreamfusion: Text-to-3d using 2d diffusion.” arXiv preprint arXiv:2209.14988 (2022).
- Chen, Yongwei, et al. “TANGO: Text-driven Photorealistic and Robust 3D Stylization via Lighting Decomposition.” arXiv preprint arXiv:2210.11277 (2022).
- Rombach, Robin, et al. “High-resolution image synthesis with latent diffusion models.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022.
- Radford, Alec, et al. “Learning transferable visual models from natural language supervision.” International Conference on Machine Learning. PMLR, 2021.
- Michel, Oscar, et al. “Text2mesh: Text-driven neural stylization for meshes.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022.
- Haque, Ayaan, et al. “Instruct-NeRF2NeRF: Editing 3D Scenes with Instructions.”arXiv preprint arXiv:2303.12789(2023).
- https://github.com/ashawkey/stable-dreamfusion
- https://github.com/CompVis/stable-diffusion
Learned Neural Radiance Field as scene representation showing nice geometry reconstruction but requires >30 images. (NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis, Ben Mildenhall et al. )

Generalizable neural representation works with as few as three input views using generalizable priors. (SparseNeuS: Fast Generalizable Neural Surface Reconstruction from Sparse Views, Long et al.) This project would be in this research line and aims to improve further.

Our previous work in this line of research. Aims at achieving finer details. Will be presented on CVPR’23.
Description :
Neural Rendering is a branch of rendering technique that replaces one or more parts of the rendering pipeline with neural networks. The inductive bias of neural networks has been proven helpful in many Neural Rendering tasks, such as novel view synthesis (NeRF), surface reconstruction (volSDF), and material acquisition (NeRD).
In this project, we are interested in the generalization ability of Neural Rendering, e.g., given a set of images, infer novel views directly. There are several benefits of using generalizable features. Firstly, we skip the long training procedure with a fast-forward inference. Secondly, learnable features are beneficial in sparse input cases. Thirdly, the framework enables further optimization to improve quality.
We will build a generalizable rendering model based on existing network designs such as IBRNet and MVSNet, and train using pixel loss, depth loss, and patching warping loss.
Level
MS Level: semester project/master project
Prerequisite:
Knowledge in deep learning frameworks (e.g., PyTorch), image processing, and Computer Vision.
Supervisors:
- Yufan Ren (website), Email: [email protected]
Type of work:
50% research, 50% development
References:
[1] NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis
Description:
Startup company Innoview Sàrl has developed software to recover by smartphone a watermark hidden into a grayscale image. The software needs to be adapted to run in C++ on diverse platforms, such as Android and PC. Performance tests need to be performed for different printing devices and parameter settings.
Deliverables: Report and running prototypes (Android and PC).
Prerequisites:
– knowledge of image processing / computer vision
– Coding skills in C++ and Java Android
Level: BS or MS semester project
Supervisors:
Dr Romain Rossier, Innoview Sàrl, [email protected], , tel 078 664 36 44
Prof. Roger D. Hersch, BC110, [email protected], cell: 077 406 27 09
Description:
Startup company Innoview Sàrl has developed software to recover watermarks hidden into grayscale images. Appropriate settings of parameters enable the detection of counterfeits. The goal of the project is to define optimal parameters for different sets of printing conditions (resolution, type of paper, type printing device, complexity of hidden watermark, etc..). The project involves tests on a large data set and appropriate statistics.
Deliverables: Report and running prototype (Android, Matlab).
Prerequisites:
– knowledge of image processing / computer vision
– basic coding skills in Matlab and Java Android
Level: BS or MS semester project
Supervisors:
Dr Romain Rossier, Innoview Sàrl, [email protected], , tel 078 664 36 44
Prof. Roger D. Hersch, BC110, [email protected], cell: 077 406 27 09
Description:
Learning joint representation for language and vision [1], generating image from text [2], and describing an image via text [3] has led to interesting results and research applications. However, these methods operate on a global level, i.e. a generated image matches a sentence or a generated text matches an image. Our goal in this project is to describe smaller regions in an image by words. We will be using the state-of-the-art diffusion models and language pretrained vision models to represent an image as words.
References:
[1] Radford, Alec, et al. “Learning transferable visual models from natural language supervision.” International conference on machine learning. PMLR, 2021.
[2] Rombach, Robin, et al. “High-resolution image synthesis with latent diffusion models.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022.
[3] Gal, Rinon, et al. “An image is worth one word: Personalizing text-to-image generation using textual inversion.” arXiv preprint arXiv:2208.01618 (2022).
Deliverables: Report and reproducable implementations
Prerequisites: Experience with Deep Learning, Computer Vision, PyTorch
Level: MS semester project
Type of work: 50% research, 50% implementation
Supervisors: Baran Ozaydin ([email protected])
Description: Neural Cellular Automata (NCA) models are a type of computational model that extends Conway’s Game of Life, a classic example of a cellular automaton. While the Game of Life operates on a grid of cells with discrete states (either “alive” or “dead”), NCA models operate on a multi-dimensional grid with a continuous range of states. NCA models combine the strengths of cellular automata and neural networks to create a powerful tool for simulating and understanding complex systems.
In an NCA model, the update rules for each cell’s state are determined by a neural network. The neural network takes as input the current states of the cell and its neighbors and produces an output that determines the new state of the cell. By incorporating neural networks into cellular automata models, NCA models have the potential to learn and simulate complex systems, such as biological or physical systems.
If you’re interested in learning more about what Neural Cellular Automata (NCA) models can do, there are several examples you can check out. Some of the tasks that NCA models have been used for include generating images and textures, synthesizing videos, and even classifying images.
Here are a few references to explore:
Growing Neural Cellular Automata:https://distill.pub/2020/growing-ca/
Self-Organizing Textures:https://distill.pub/selforg/2021/textures/
DyNCA: Real-Time Dynamic Texture Synthesis Using Neural Cellular Automata:
Self-classifying MNIST Digits:https://distill.pub/2020/selforg/mnist/
In this project, we will continue exploring NCA models and their potential for synthesizing images, videos, audio, and 3D objects. Together, we will brainstorm and select a specific topic for your project that aligns with your interests and goals. This project involves a lot of exploration and is an exciting opportunity to develop your research and critical thinking skills.
Deliverables:
- Code, well cleaned up and easily reproducible.
- Written Report, explaining the literature and steps taken for the project.
Prerequisites:
- Python and PyTorch.
- Experience with Deep Learning methods and Convolutional Networks
Level: M.Sc. and B.Sc.
Supervisor: Ehsan Pajouheshgar (ehsan.pajouheshgar [at] epfl [dot] ch)
Description:
Diffusion models recently became state-of-the-art for generating images. Yet, some attributes are more challenging to generate than others, e.g., hands might be deformed with a wrong number of fingers, objects may have weird ends, objects can appear merged or duplicated, etc [1, 2]. In this project, you will research the existing literature for reducing such artifacts and therefore for improving the quality of images generated by diffusion models.
Tasks:
– Is this image generated with a diffusion model? Recently, several works try to detect images generated by diffusion models [3, 4, 5]. A first step to start this project is to train a model to detect diffusion-generated images. You should ideally use the concepts of saliency map / class-activation map to have results with visual explanations, highlighting areas of the image that contain diffusion artifacts.
– Improving generated images. The main goal of this project is to improve the images generated with diffusion models. A possible solution is to use image-to-image translation methods [6, 7], translating images from the “generated images” domain to the “real images” domain. To further improve the results, you will propose a method that includes diffusion-specific architecture details, i.e., knowing that images are created gradually from noise.
References:
[1] See the channel “failed-diffusion” on the Stable Diffusion Discord, where people share their failures.
[2] Tangermann, V. (2022, October 24). We’re cry-laughing at these “spectacular failures” of ai generated art. Futurism. From https://futurism.com/the-byte/ai-generated-art-failures
[3] Coccomini, D. A., Esuli, A., Falchi, F., Gennaro, C., & Amato, G. (2023). Detecting Images Generated by Diffusers. arXiv preprint arXiv:2303.05275.
[4] Corvi, R., Cozzolino, D., Zingarini, G., Poggi, G., Nagano, K., & Verdoliva, L. (2022). On the detection of synthetic images generated by diffusion models. arXiv preprint arXiv:2211.00680.
[5] Bird, J. J., & Lotfi, A. (2023). CIFAKE: Image Classification and Explainable Identification of AI-Generated Synthetic Images. arXiv preprint arXiv:2303.14126.
[6] Zhu, J. Y., Park, T., Isola, P., & Efros, A. A. (2017). Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision (pp. 2223-2232).
[7] Alami Mejjati, Y., Richardt, C., Tompkin, J., Cosker, D., & Kim, K. I. (2018). Unsupervised attention-guided image-to-image translation. Advances in neural information processing systems, 31.
Deliverables: At the end of the semester, the student should provide a framework for detecting diffusion-generated images and for improving the quality of diffusion-generated images. Deliverables should include code, well cleaned up and easily reproducible, as well as a written report, explaining the models, the steps taken for the project and the performances of the models.
Prerequisites: Python and PyTorch.
Level: MS research project
Number of students: 1
Supervisor: Martin Nicolas Everaert (martin.everaert [at] epfl.ch)
Image-based rendering can date back to the 1990s. Unlike traditional Computer Graphics rendering, which requires explicit scene geometry and scene texture, image-based rendering renders a scene based on observations of the scene, i.e., photographs taken in the real/synthesised scene.
Given an image collection of a scene under different viewing directions, the method of NeRF can faithfully synthesise novel views that are 3D-consistent. However, it is still an open question of how to render the scene under a novel lighting condition.
There are a few works about relighting a NeRF-like implicit scene representation. There are two steps for this process: scene decomposition and 3D scene extraction from the decomposition. Afterwards, we can relight the 3D representation easily. A common material model assumed for all objects of interests is the Disney BSDF model. However, there are a variety of other materials such as glass or glints that cannot be adequately represented by this assumption.
In this project, we work with a dedicated type of surface appearance and explore various 3D scene representations, either implicit or explicit, and develop corresponding scene decomposition strategies which would then enable scene novel view inquires and relighting.
Type of work
- MS Level: semester project / master project ** can be adapted to bachelor project in case of a capable bachelor student.
- 60% research, 40% development
Prerequisite:
- Knowledge in deep learning frameworks (e.g., PyTorch or Tensorflow), image processing and Computer Vision.
- Experience with 3D vision is required.
- Knowledge with Computer Graphics is recommended.
Supervisor:
- Dongqing Wang, [email protected]
Reference Literature:
- NeRF: Neural Radiance Field https://www.matthewtancik.com/nerf
- NeRV: Neural Reflectance and Visibility Fields for Relighting and View Synthesis https://pratulsrinivasan.github.io/nerv/
- Neural Radiance Fields for Outdoor Scene Relighting https://4dqv.mpi-inf.mpg.de/NeRF-OSR/
- NeRFactor: Neural Factorization of Shape and Reflectance Under an Unknown Illumination https://arxiv.org/abs/2106.01970
Comics Projects
Visual computing on comics is a very exciting yet very challenging problem because of the lack of annotated data, domain shifts, and intra-domain variability. To tackle this problem, we aim to use segmentation, saliency detection, depth estimation along with style transfer. In the projects that follow you will broaden your skills by understanding how to use state-of-the-art computer vision algorithms on a customized comics dataset. We aim to publish the results in the upcoming venues.
Lack of annotated data
The lack of annotated data in the comics domain makes it difficult to train machine learning models for vision tasks such as image classification, segmentation, depth prediction, and saliency prediction. Without annotated data, it is challenging for the model to learn the essential features and patterns to perform the task accurately.
Domain gap
The comics domain has unique properties that are not present in the training data (natural images), resulting in a domain gap between the training data and the comics data. This leads to poor performance and difficulty to generalize when the model is applied to comics data.
Variability within the domain
Different artists and publishers often use unique visual styles and techniques while producing comics. For instance, one artist may represent a chair substantially differently than another, resulting in a variance in the object’s appearance. This type of discrepancy is uncommon in images of nature. Therefore, it is challenging for a model to accurately perform tasks such as image classification or segmentation by grouping objects that seem to belong to distinct categories.
Description:
Salient regions are the areas that stand out compared to their surroundings. Saliency prediction aims to predict which areas attract the most attention. However, unlike image classification, the saliency prediction task lacks large-scale annotated datasets. Current approaches for saliency estimation use eye tracking data on natural images for constructing ground truth. However, in our project we will perform saliency detection on comics pages instead of natural images.
In this project, you will explore the existing saliency prediction methods on comics stimuli. You will test and improve a method proposed for saliency prediction using inter-object relationships. Then, you will evaluate your model by comparing to the state-of-the-art saliency detection models.
Tasks:
– Understand the literature and state-of-art
– Test existing methods for saliency prediction on comics
– Improve a saliency prediction method proposed by using object-character interactions
– Evaluate state-of-the-art saliency detection models on comics data
Prerequisites:
Experience in machine learning and computer vision, experience in Python, experience in deep learning frameworks
Deliverables:
Reproducible code and a written report
Level:
MS semester or thesis project
Type of work:
60% research, 40% development and testing
References:
“DeepComics: Saliency estimation for comics”, Kevin Bannier, Eakta Jain, Olivier Le Meur
Khetarpal and E. Jain, “A preliminary benchmark of four saliency algorithms on comic art,” 2016 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), Seattle, WA.
Daniel V. Ruiz and Bruno A. Krinski and Eduardo Todt, IDA: Improved Data Augmentation Applied to Salient Object Detection, 2020 SIBGRAPI
Supervisor:
Bahar Aydemir ([email protected])
Description:
Visual saliency refers to a part in a scene that captures our attention. Current approaches for saliency estimation use eye tracking data on natural images for constructing ground truth. However, in our project we will perform analysis of eye tracking data on comics pages instead of natural images. Later, we will use the collected data to estimate saliency in comics domain.
Tasks:
– Understand the key points of an eye tracking experiment and our setup.
– Conduct an analysis of eye tracking data according to given instructions.
Deliverables: At the end of the semester, the student should provide the code and a report of the work.
Type of work: 20% research, 80% development and testing
References:
[1] A. Borji and L. Itti, “Cat2000: A large scale fixation dataset for boosting saliency research,” CVPR 2015 workshop on ”Future of Datasets”, 2015.
[2] Kai Kunze , Yuzuko Utsumi , Yuki Shiga , Koichi Kise , Andreas Bulling, I know what you are reading: recognition of document types using mobile eye tracking, Proceedings of the 2013 International Symposium on Wearable Computers, September 08-12, 2013, Zurich, Switzerland.
[3] K. Khetarpal and E. Jain, “A preliminary benchmark of four saliency algorithms on comic art,” 2016 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), Seattle, WA.
Level: BS semester project
Supervisor: Bahar Aydemir ([email protected])
It is well known that deep learning models need to be trained with a huge amount of data. To circumvent this limitation if only a small dataset is available, models are pre-trained on large datasets and then fine-tuned on the few available labels from the target domain. The models are usually trained on ImageNet or even larger datasets with labeled or in a self-supervised manner. Although those models provide strong baselines, the performance of different training schemas varies in different scenarios and there is no guideline for choosing the model to use.
In our project, we will investigate this problem with a novel comic dataset. We explore a large array of differently pre-trained models with different datasets, such as supervised and diverse self-supervised models, to then fine-tune them with our own labeled comic data. Through thorough analysis of the results, we will provide insights into finding the best model for transfer learning. At the end of the semester, you should therefore be able to give an overview of the existing transfer-learning methods, how these work applied on our comic dataset, and theorize on their applicability beyond a single domain
Task:
– Evaluate differently pre-trained models on comic data.
– Pre-train the model in a self-supervised learning manner on comic data directly.
– Establish the possibilities and limitations of transfer learning on comic data and other domains.
Prerequisites:
Having knowledge on python, deep learning framework (tensorflow or pytorch), and linear algebra is required.
Level:
MS project
Type of work:
20% literature review, 40% research, 40% development and test.
References:
[1] Rankme: Assessing the downstream performance of pretrained self-supervised representations by their trank , Quentin Garrido, Randall Balestriero, Laurent Najman, Yann LeCun
[2]. Dive into Self-Supervised Learning for Medical Image Analysis: Data, Models and Tasks , Chuyan Zhanga, Yun Gua
[3]. Rethinking ImageNet Pre-training, Kaiming He, Ross Girshick, Piotr Dollár
[4]. Masked Autoencoders are Scalable Vision Learner, Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick
Supervisors: Peter Grönquist ([email protected]) , Tong Zhang([email protected])
Description (Master Semester Project open to EPFL students)
In this project, you will do a comparative analysis on the existing literature for monocular depth estimation while applying them on comics images. The state-of-the-art monocular depth estimation methods [1] [2] [3] are applied to natural photographs but when they are applied to comics images they severely underperform. In this project you would a] apply all the existing depth methods directly on the comics images and b] use an existing translation method to translate the comics images to real ones, and then apply all the existing depth baselines [4]. Following this, you will analyse the results qualitatively and quantitatively.
You may contact the supervisor at any time should you want to discuss the idea further.
References
[1] Monocular Depth Estimation: A Survey https://arxiv.org/abs/1901.09402
[2] Monocular Depth Estimation Based On Deep Learning: An Overview https://arxiv.org/abs/2003.06620
[3] Monocular Depth Estimation Using Deep Learning: A Review https://www.mdpi.com/1424-8220/22/14/5353/pdf
[3] Estimating Image Depth in the Comics Domain https://arxiv.org/abs/2110.03575
Type of Work (e.g., theory, programming)
30% research, 70% implementation and testing
Prerequisites
Good experience in deep learning, experience in Python, Pytorch. Experience in statistical analysis (finding absolute errors or RMSE between the prediction and the ground-truth) to report the performance evaluations of the models.
Models will run on RunAI. (We will guide you how to use RunAI- no prior knowledge required).
Supervisor
Deblina BHATTACHARJEE ([email protected])
Description (Master Semester Project or Master Thesis Project open to EPFL students)
In this project, you will research the existing literature on multi-task learning and self-supervised approaches [refer to 1]. Further you will read literature about how to use these two in conjunction, and then implement the method [refer to 2]. You will have to think of a novel idea to improve on this method for which you will receive guidance from your supervisor. Thereafter you will evaluate your method against the existing baselines. In doing so, you will learn about the various approaches to improve joint-training on dense vision tasks.
Bonus (applicable to Master thesis students only): We will work with transformer frameworks and working on the explainability/ interpretability of transformers will fetch you extra points.
You may contact the supervisor at any time should you want to discuss the idea further.
Reference
[1] Multi-Task Learning for Dense Prediction Tasks: A Survey, Simon Vandenhende et.al.
[2] Multi-task Self-Supervised Visual Learning, Doersch and Zisserman.
Type of Work (e.g., theory, programming)
50% research, 50% development and testing
Prerequisites
Experience in deep learning, experience in Python, Pytorch. Experience in statistical analysis to report the performance evaluations of the models.
Models will run on RunAI. (We will guide you how to use RunAI- no prior knowledge required).
Supervisor(s)
Deblina BHATTACHARJEE ([email protected])
Description: Diffusion models recently became state-of-the-art for generating images. These models are trained as follows: images from the training set are deteriorated with Gaussian noise at different levels, and the model is tasked to gradually denoise these images. At inference time, to generate an image, one starts with noise et gradually denoises it. In the particular case of generating images with text-to-image diffusion models, one commonly relies on classifier-free guidance [1] to improve alignment of the generated image with the textual prompt. Another possible way to improve this text-image alignment is to use CLIP-guidance [2, “Classifier-guidance” in 3]. The first solution, classifier-free guidance, requires to do 2 forward passes in each inference step, one with the textual prompt and one without, which doubles the time needed to generate an image. The second solution, CLIP-guidance, is rarely used in practice with text-to-image diffusion models, because it requires to do additional forward and backward passes through the CLIP model, which makes the image generation significantly slower, e.g., 40 seconds instead of 5 seconds. In this project, you will use these guidance terms at training time instead of inference time. You will therefore fine-tune diffusion models with additional terms in the loss.
Tasks:
– Distillation of classifier-free guidance. You will propose modifications to the training loop of diffusion models to distill the classifier-free guidance at training time instead of inference time. You will then fine-tune a pretrained diffusion model with these modifications and evaluate the results along several axes: generation time, image quality, and text-image alignment.
– Feed-forward CLIP-guided diffusion. You will propose modifications to the training loop of diffusion models to account for CLIP-guidance at training time. You will then fine-tune a pretrained diffusion model with these modifications and evaluate the results along several axes: generation time, image quality, and text-image alignment.
References:
[1] Ho, J., & Salimans, T. (2022). Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598.
[2] Suraj Patil (2022). CLIP guidance with Stable Diffusion. From https://github.com/huggingface/diffusers/blob/main/examples/community/clip_guided_stable_diffusion.py
[3] Dhariwal, P., & Nichol, A. (2021). Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems, 34, 8780-8794.
Deliverables: At the end of the semester, the student should provide a framework to fine-tune and evaluate text-to-image models with feed-forward guidance. Deliverables should include code, well cleaned up and easily reproducible, as well as a written report, explaining the models, the steps taken for the project and the performances of the models.
Prerequisites: Python and PyTorch.
Level: BS or MS research project
Number of students: 1
Supervisor: Martin Nicolas Everaert (martin.everaert [at] epfl.ch)