Here is a list of master and semester projects currently proposed at the DHLAB. For most projects, descriptions are initial seeds and the work can be adjusted depending on the skills and the interests of the students. For a list of already completed projects (with code and reports), see this GitHub page.
- Are you interested in a project listed below and it is marked as available? Write an email to the person(s) of contact mentioned in the project description, saying in which section and year you are, and possibly including a statement of your last grades.
- You want to propose a project or are interested by the work done at the DHLAB? Write an email to Frédéric Kaplan and Maud Ehrmann, explaining what you would like to do.
Spring 2024
Project descriptions coming soon …
Type: MSc Semester project
Sections: Architecture, Digital Humanities, Data Science, Computer Science
Supervisor: Beatrice Vaienti, Hamest Tamrazyan
Keywords: 3D modelling
Number of students: 1–2 (min. 12 ECTS in total)
Context: The Armenian cultural heritage in Artsakh [1] is in danger after explicit threats of irreversible destruction coordinated by Azerbaijan authorities [2]. As part of a series of actions coordinated by the EPFL, the Digital Humanities Institute is currently prototyping methods to offer rapid deployment of DH technology in situations of crisis. As part of these actions, over 200 Armenian inscriptions of Artsakh with essential information such as the language data including diplomatic and interpretive transcriptions, the translation into English, the location of the inscription on the monument (if applicable), geographical and chronological data, the type of monument, and the type of inscription systematize and digitize the inscriptions on the monuments of Armenian cultural heritage in Nagorno-Karabakh. This digitized data will help not only to preserve the invaluable inscriptions but also can be used for further investigations and research. The aim of this project is to create a 3D model of the church successfully, accurately locate the inscriptions, and contribute to preserving, studying, and promoting Armenian cultural heritage in Nagorno-Karabakh.

Figure: By Julian Nyča – Own work, CC BY-SA 3.0
Research questions:
- How can advanced imaging and 3D modelling technologies be utilized to accurately capture and represent the intricate details of the church and the inscriptions?
- What methods can be employed to ensure the precise alignment and placement of the digitized inscriptions within the 3D model of the church?
- How can the digital representation of the church and its inscriptions be effectively integrated with the database of Armenian inscriptions in Nagorno-Karabakh?
- What insights can be gained from analysing the spatial distribution and arrangement of the inscriptions within the church, shedding light on the historical and cultural context in which they were created?
- How can the integration of the 3D model and the digitized inscriptions contribute to the preservation, documentation, and study of Armenian epigraphic heritage in Nagorno-Karabakh?
Objectives: This project aims to deepen the knowledge of the architectural and epigraphic significance of the church, explore innovative techniques for digitizing and visualizing cultural heritage, and contribute to the preservation and accessibility of Armenian inscriptions in Nagorno-Karabakh.
Main steps:
- Research and Planning: Conduct thorough research on the church and its inscriptions, architectural features, and existing documentation, and develop a detailed plan for creating a 3D model and locating the inscriptions within the model.
- Study of the plan, section, and elevation views from a survey: Proportional and compositional study aimed at the 3D reconstruction.
- 3D reconstruction: A low level of detail mesh already exists, but with this project, the student will try to transfer the information from the architectural survey to a refined architectural 3D model with interiors and exteriors.
- Data Processing and Digitization: Process the collected data to digitally represent the church and the inscriptions.
- Inscription Localization: Analyse the collected data and determine the precise location of each inscription within the 3D model of the church.
- Data Integration: Ensure that the essential information, such as language data, translations, geographical and chronological data, monument and inscription types, bibliographic references, and photographs, are correctly linked to the localized inscription.
Explored methods:
- Proportional analysis
- 3D modelling using Rhino
- 3D segmentation and annotation with the inscription
- Exploration of visualization methodologies for this additionally embedded information
Requirements: previous experience with architectural 3D modelling using Rhino.
[1] A toponym used by the local Armenians to refer to Nagorno-Karabagh territory
[2] the European Parliament resolution on the destruction of cultural heritage in Nagorno-Karabakh (2022/2582(RSP)) dated 09.03.20022.
Type: MSc Semester project (12 ECTS) or MSc project
Sections: Data Science, Computer Science, Digital Humanities
Supervisors: Emanuela Boros, Maud Ehrmann
Number of students: 1

Figure: Historical Instruction Mining for (finetuning) Large Language Models (Generated with Midjourney)
Context: Historical collections present multiple challenges that depend either on the quality of digitization, the need to handle documents deteriorated by the effect of time, the poor quality printing materials or inaccurate scanning processes such as optical character recognition (OCR) or optical layout recognition (OLR). Moreover, historical collections can pose another challenge due to the fact that documents are distributed over a long enough period of time to be affected by language change. This is especially true in the case of Western European languages, which only acquired their modern spelling standards roughly around the 18th or 19th centuries. At the same time, language models (LMs) have been leveraged for several years now, obtaining state-of-the-art performance in the majority of NLP tasks, by generally being fine-tuned on downstream tasks (such as entity recognition). LLMs or instruction-following models have taken over with relatively new capabilities in solving some of these tasks in a zero-shot manner through prompt engineering. For example, the generative pre-trained transformer (GPT) family of LLMs refers to a series of increasingly powerful and large-scale neural network architectures developed by OpenAI. Starting with GPT, the subsequent models have witnessed substantial growth, such as ChatGPT, GPT-4. These increased sizes allow the models to capture more intricate patterns in the training data, resulting in better performance on various tasks (like acing exams). Nevertheless, they seem to fail in understanding and reasoning when it handles historical documents. This project aims to at building a dataset in a semi-automatic manner for improving the application of LLMs in historical data analysis.
Research Questions:
- Is it feasible to create a dataset for training LLMs to better comprehend historical documents, using semi-automatic or automatic methods?
- Can a specialized, resource-efficient LLM effectively process and understand noisy, historical digitized documents?
Objective: To develop an instruction-based dataset to improve LLMs’ capabilities in interpreting historical documents. The focus will be on sourcing and analyzing historical Swiss and Luxembourgish newspapers (spanning 200 years) and other historical collections in ancient Greek or Latin.
Instruction/Prompt: When was the Red Cross founded?
Example Answer: 1864
Instruction / Prompt: Given the following excerpt from a Luxembourgish newspaper from 1919, identify the main event and key figures involved. Excerpt: “En 1919, la Grande-Duchesse Charlotte est montée sur le trône du Luxembourg, succédant à sa sœur, la Grande-Duchesse Marie-Adélaïde, qui avait abdiqué en raison de controverses liées à la Première Guerre mondiale.”
Example Response: Grand Duchess Charlotte and her sister, Grand Duchess Marie-Adélaïde.
Main Steps:
- Data Curation:
- Collect OCR-based datasets.
- Analyze historical newspaper articles to understand common features and challenges.
Optical character recognition (OCR) or optical character reader is the electronic or mechanical conversion of images of typed, handwritten or printed text into machine-encoded text, whether from a scanned document, a photo of a document, a scene photo or from subtitle text superimposed on an image.
- Dataset Creation:
-
Model Training/Fine-Tuning:
- Train or fine-tune a language model like LLaMA on this dataset.
-
Evaluation:
- Assess LLMs’ performance on NLP tasks (NER, EL) using historical documents.
- Compare models trained on the new dataset with those trained on standard datasets.
- Employ metrics like accuracy, perplexity, F1 score.
Requirements
- Proficiency in Python, ideally PyTorch.
- Strong writing skills.
- Commitment to the project.
Outputs
- Potential publications in NLP and historical document processing.
- Contribution to advancements in handling historical texts with LLMs.
Deliverables
- A comprehensive dataset for training LLMs on historical texts.
- A report or paper detailing the methodology, findings, and implications of the project.
Fall 2023
Type: MSc Semester project
Sections: Architecture, Digital Humanities, Data Science, Computer Science
Supervisor: Beatrice Vaienti, Hamest Tamrazyan
Keywords: 3D modelling
Number of students: 1–2 (min. 12 ECTS in total)
Context: The Armenian cultural heritage in Artsakh [1] is in danger after explicit threats of irreversible destruction coordinated by Azerbaijan authorities [2]. As part of a series of actions coordinated by the EPFL, the Digital Humanities Institute is currently prototyping methods to offer rapid deployment of DH technology in situations of crisis. As part of these actions, over 200 Armenian inscriptions of Artsakh with essential information such as the language data including diplomatic and interpretive transcriptions, the translation into English, the location of the inscription on the monument (if applicable), geographical and chronological data, the type of monument, and the type of inscription systematize and digitize the inscriptions on the monuments of Armenian cultural heritage in Nagorno-Karabakh. This digitized data will help not only to preserve the invaluable inscriptions but also can be used for further investigations and research. The aim of this project is to create a 3D model of the church successfully, accurately locate the inscriptions, and contribute to preserving, studying, and promoting Armenian cultural heritage in Nagorno-Karabakh.

Figure: By Julian Nyča – Own work, CC BY-SA 3.0
Research questions:
- How can advanced imaging and 3D modelling technologies be utilized to accurately capture and represent the intricate details of the church and the inscriptions?
- What methods can be employed to ensure the precise alignment and placement of the digitized inscriptions within the 3D model of the church?
- How can the digital representation of the church and its inscriptions be effectively integrated with the database of Armenian inscriptions in Nagorno-Karabakh?
- What insights can be gained from analysing the spatial distribution and arrangement of the inscriptions within the church, shedding light on the historical and cultural context in which they were created?
- How can the integration of the 3D model and the digitized inscriptions contribute to the preservation, documentation, and study of Armenian epigraphic heritage in Nagorno-Karabakh?
Objectives: This project aims to deepen the knowledge of the architectural and epigraphic significance of the church, explore innovative techniques for digitizing and visualizing cultural heritage, and contribute to the preservation and accessibility of Armenian inscriptions in Nagorno-Karabakh.
Main steps:
- Research and Planning: Conduct thorough research on the church and its inscriptions, architectural features, and existing documentation, and develop a detailed plan for creating a 3D model and locating the inscriptions within the model.
- Study of the plan, section, and elevation views from a survey: Proportional and compositional study aimed at the 3D reconstruction.
- 3D reconstruction: A low level of detail mesh already exists, but with this project, the student will try to transfer the information from the architectural survey to a refined architectural 3D model with interiors and exteriors.
- Data Processing and Digitization: Process the collected data to digitally represent the church and the inscriptions.
- Inscription Localization: Analyse the collected data and determine the precise location of each inscription within the 3D model of the church.
- Data Integration: Ensure that the essential information, such as language data, translations, geographical and chronological data, monument and inscription types, bibliographic references, and photographs, are correctly linked to the localized inscription.
Explored methods:
- Proportional analysis
- 3D modelling using Rhino
- 3D segmentation and annotation with the inscription
- Exploration of visualization methodologies for this additionally embedded information
Requirements: previous experience with architectural 3D modelling using Rhino.
[1] A toponym used by the local Armenians to refer to Nagorno-Karabagh territory
[2] the European Parliament resolution on the destruction of cultural heritage in Nagorno-Karabakh (2022/2582(RSP)) dated 09.03.20022.
Type: MSc Semester project (12 ECTS) or MSc project
Sections: Data Science, Computer Science, Digital Humanities
Supervisors: Emanuela Boros, Maud Ehrmann
Number of students: 1–2

Figure: Historical Instruction Mining for (finetuning) Large Language Models (Generated with Midjourney)
Context: Historical collections present multiple challenges that depend either on the quality of digitization, the need to handle documents deteriorated by the effect of time, the poor quality printing materials or inaccurate scanning processes such as optical character recognition (OCR) or optical layout recognition (OLR). Moreover, historical collections can pose another challenge due to the fact that documents are distributed over a long enough period of time to be affected by language change. This is especially true in the case of Western European languages, which only acquired their modern spelling standards roughly around the 18th or 19th centuries. At the same time, language models (LMs) have been leveraged for several years now, obtaining state-of-the-art performance in the majority of NLP tasks, by generally being fine-tuned on downstream tasks (such as entity recognition). LLMs or instruction-following models have taken over with relatively new capabilities in solving some of these tasks in a zero-shot manner through prompt engineering. For example, the generative pre-trained transformer (GPT) family of LLMs refers to a series of increasingly powerful and large-scale neural network architectures developed by OpenAI. Starting with GPT, the subsequent models have witnessed substantial growth, such as ChatGPT, GPT-4. These increased sizes allow the models to capture more intricate patterns in the training data, resulting in better performance on various tasks (like acing exams). Nevertheless, they seem to fail in understanding and reasoning when it handles historical documents. This project aims to at building a dataset in a semi-automatic manner for improving the application of LLMs in historical data analysis.
Research Questions:
- Can we create a dataset in a (semi-automatic/automatic) manner for training an LLM to better understand historical documents?
- Can a specialized, resource-efficient LLM effectively process noisy historical digitised documents?
Objective: The objective of this project is to develop an instruction-based dataset to enhance the ability of LLMs to understand and interpret historical documents. This will involve sourcing historical Swiss and Luxembourgish newspapers spanning 200 years, as well as other historical collections such as those in ancient Greek or Latin. Two fictive examples:
Instruction/Prompt: When was the Red Cross founded?
Example Answer: 1864
Instruction / Prompt: Given the following excerpt from a Luxembourgish newspaper from 1919, identify the main event and key figures involved. Excerpt: “En 1919, la Grande-Duchesse Charlotte est montée sur le trône du Luxembourg, succédant à sa sœur, la Grande-Duchesse Marie-Adélaïde, qui avait abdiqué en raison de controverses liées à la Première Guerre mondiale.”
Example Response: Grand Duchess Charlotte and her sister, Grand Duchess Marie-Adélaïde.
Main Steps:
- Data Curation: Gather datasets based on OCR level, and familiarize with the corpus by exploring historical newspaper articles. Identify common features of historical documents and potential difficulties.
- Dataset Creation: Decide on what type of instruction should be generated and utilise other existing LLMs such as T5, BART, etc., to generate instructions (or questions) from Swiss historical documents.
- (Additional) Train or finetune a LLaMA language model based on this dataset.
- (Additional) Evaluation: Utilise the newly created dataset to evaluate existing LLMs and assess their performance on various NLP tasks such as named entity recognition (NER) and linking (EL) in historical documents. Use standard metrics such as accuracy, perplexity, or F1 score for the evaluation. Compare the performance of models trained with the new dataset against those trained with standard datasets to ascertain the effectiveness of the new dataset.
Requirements: Proficiency in Python, preferably PyTorch, excellent writing skills, and dedication to the project.
Outputs: The project’s results could potentially lead to publications in relevant research areas and would contribute to the field of historical document processing.
Type: MSc Semester project (12 ECTS) or MSc project
Sections: Data Science, Computer Science, Digital Humanities
Supervisors: Emanuela Boros, Maud Ehrmann
Number of students: 1-2

Figure: Forecasting News (Generated with Midjourney)
Context: The rapid evolution and widespread adoption of digital media have led to an explosion of news content. News organizations are continuously publishing articles, making it increasingly challenging to keep track of daily developments and their implications. To navigate this overwhelming amount of information, computational methods capable of processing and making predictions about this content are of significant interest. With the advent of advanced machine learning techniques, such as Generative Adversarial Networks (GANs), and Large Language Models (LLMs), it’s possible to forecast future content based on existing articles. This project proposes to leverage the strengths of both GANs and LLMs to predict the content of next-day articles based on current-day news. This approach will not only allow for a better understanding of how events evolve over time, but also could serve as a tool for news agencies to anticipate and prepare for future news developments.
Research Questions:
- Can we design a system that effectively leverages GANs and LLMs to predict next-day article content based on current-day articles?
- How accurate are the generated articles when compared to the actual articles of the next day (how close to reality are they)?
- What are the limits and potential biases of such a system and how can they be mitigated?
Objective: The objective of this project is to design and implement a system that uses a combination of GANs and LLMs to predict the content of next-day news articles. This will be measured by the quality, coherence, and accuracy of the generated articles compared to the actual articles from the following day.
Main Steps:
- Dataset Acquisition: Procure a dataset consisting of sequential daily articles from multiple sources.
- Data Preprocessing: Clean and preprocess the data for training. This involves text normalization, tokenization, and the creation of appropriate training pairs.
- Generator Network Design: Leverage an LLM as the generator network in the GAN. This network will generate the next-day article based on the input from the current-day article.
- Discriminator Network Design: Build a discriminator network capable of distinguishing between the actual next-day article and the generated article.
- GAN Training: Train the GAN system by alternating between training the discriminator to distinguish real vs generated articles, and training the generator to fool the discriminator.
- Evaluation: Assess the generated articles based on measures of text coherence, relevance, and similarity to the actual next-day articles.
- Bias and Limitations: Examine and discuss the potential limitations and biases of the system, proposing ways to address these issues.
Master Project Additions:
If the project is taken as a master project, the student will further:
- Refine the Model: Apply advanced training techniques, and perform a detailed hyperparameter search to optimize the GAN’s performance.
- Multi-Source Integration: Extend the model to handle and reconcile articles from multiple sources, aiming to generate a more comprehensive next-day article.
- Long-Term Predictions: Investigate the model’s capabilities and limitations in making longer-term predictions, such as a week or a month in advance.
Requirements: Knowledge of machine learning and deep learning principles, familiarity with GANs and LLMs, proficiency in Python, and experience with a deep learning framework, preferably PyTorch.
Type: BA (8ECST) Semester project
Sections: Data Science, Computer Science, Digital Humanities
Supervisors: Emanuela Boros
Number of students: 1–2
Keywords: Document processing, Natural Language Processing (NLP), Information Extraction, Machine Learning, Deep Learning

Figure: Assessing Climate Change Perceptions and Behaviours in Historical Newspapers (Generated with Midjourney)
With emissions in line with current Paris Agreement commitments, global warming is projected to exceed 1.5°C above pre-industrial levels, even if these commitments are complemented by very difficult increases in magnitude and intensity and ambition of mitigation after 2030. Despite this slight increase, the consequences of global warming are already observable today, with the number and intensity of certain natural hazards continuing to increase (e.g., extreme weather events, floods, forest fires). Near-term warming and increased frequency, severity, and duration of extreme events will place many terrestrial, freshwater, coastal and marine ecosystems at high or very high risks of biodiversity loss. Exploring historical documents can help to address gaps in our understanding of the historical roots of climate change, and possibly uncover evidence of early efforts to address environmental issues, as well as explore how environmentalism has evolved over time. This project aims to fill gaps in our understanding by examining a corpus of historical Swiss and Luxembourgish newspapers spanning 200 years (i.e., the impresso project’s corpus).
Research Questions:
- How have perceptions of climate change evolved over time, as seen in historical newspapers?
- What behavioural trends towards climate change can be identified from historical newspapers?
- Can we track the frequency and intensity of extreme weather events over time based on historical documents?
- Can we identify any patterns or trends in early efforts to address environmental issues?
- How has the sentiment towards climate change and environmentalism evolved over time?
Objective: This work explores several NLP techniques (text classification, information extraction, etc.) for providing a comprehensive understanding of the evolution and reporting of extreme weather events in historical documents.
Main Steps:
- Data Preparation: Identify relevant keywords and phrases related to climate change and environmentalism, such as “global warming”, “carbon emissions”, “climate policy”, or “hurricane”, “flood”, “drought”, “heat wave”, and others. Compile a training dataset of articles that are around these relevant keywords.
- Data Analysis: Analyse the data and identify patterns in climate change perceptions and behaviours over time. This includes the identification of changes in the frequency of climate-related terms, changes in sentiment towards climate change, changes in the topics discussed in relation to climate change, the detection of mentions of locations, persons, or events, and the extraction of important keywords in weather forecasting news.
Requirements: Candidates should have a background in machine learning, data engineering, and data science, with proficiency in NLP techniques such as Named Entity Recognition, Topic Detection, or Sentiment Analysis. A keen interest in climate change, history, and media studies is also beneficial.
Resources:
- Analyzing the impact of climate change on critical infrastructure from the scientific literature: A weakly supervised NLP approach
- Climate of scepticism: US newspaper coverage of the science of climate change
- We provide a nlp-beginner-starter jupyter notebook.
Type: MSc (12 ECTS) or BA (8ECST) Semester project
Sections: Computer Science, Data Science, Digital Humanities
Supervisors: Maud Ehrmann, Emanuela Boros
Number of students: 1–2
Figure: Exploring Large Vision-Language Pre-trained Models for Historical Images Classification and Captioning (Generated with Midjourney)
Context: The impresso project features a dataset of around 90 digitised historical newspapers containing approximately 3 million images. These images have no labels, and only 10% of them have a caption, two aspects that hinder their retrieval.
Two previous student projects focused on the automatic classification of these images, trying to identify 1. the type of image (e.g. photograph, illustration, drawing, graphic, cartoon, map), and 2. in the case of maps, the country or region of the world represented on that map. Good performances on image type classification were achieved by fine-tuning the VGG-16 pre-trained model (see report).
Objective: On the basis of these initial experiments and the annotated dataset compiled on this occasion, the present project will explore recent large-scale language-vision pre-trained models. Specifically, the project will attempt to:
- Evaluate zero-shot image type classification of the available dataset using the CLIP and LLaMA-Adapter-V2 Multi-modal models, and compare with previous performances;
- Explore and evaluate image captioning of the same dataset, including trying to identify countries or regions of the world. This part will require adding caption information on the test part of the dataset. In addition to the fluency and accuracy of the generated captions, a specific aspect that might be taken into account is distinctiveness, i.e. whether the image contains details that differentiate it from similar images.
Given the recency of the models, further avenues of exploration may emerge during the course of the project, which is exploratory in nature.
Requirements: Knowledge of machine learning and deep learning principles, familiarity with computer vision, proficiency in Python, experience with a deep learning framework (preferably PyTorch), and interest in historical data.
References:
- CLIP repository and model: https://github.com/openai/CLIP
- LLaMA-Adapter-V2 Multi-modal: https://github.com/ZrrSkywalker/LLaMA-Adapter/tree/main/llama_adapter_v2_multimodal
- Zhang, Y., Wang, J., Wu, H., Xu, W. (2023). Distinctive Image Captioning via CLIP Guided Group Optimization. In: Karlinsky, L., Michaeli, T., Nishino, K. (eds) Computer Vision – ECCV 2022 Workshops. ECCV 2022. Lecture Notes in Computer Science, vol 13804. Springer, Cham.
- Learning Transferable Visual Models From Natural Language Supervision. Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever Proceedings of the 38th International Conference on Machine Learning, PMLR 139:8748-8763, 2021.
Spring 2023
Type: MSc Semester project
Sections: Digital humanities, Data Science, Computer Science
Supervisor: Rémi Petitpierre
Keywords: Data visualisation, Web design, IIIF, Memetics, History of Cartography
Number of students: 1–2 (min. 12 ECTS in total)
Context: Cultural evolution can be modeled as an ensemble of ideas, and conventions, that spread from one mind to another. This is referred to as memes, elementary replicators of culture inspired by the concept of a gene. A meme can be a tune, a catch-phrase, or an architectural detail, for instance. If we take the example of maps, a meme can be expressed in the choice of a certain texture, a colour, or a symbol, to represent the environment. Thus, memes spread from one cartographer to another, through time and space, and can be used to study cultural evolution. With the help of computer vision, it is now possible to track those memes, by finding replicated visual elements through large datasets of digitised historical maps.
Below is an example of visual elements extracted from a 17th century map (original IIIF image). By extending the process to tens of thousands of maps and embedding these elements in a feature space, it becomes possible to estimate which elements correspond to the same replicated visual idea, or meme. This opens up new ways to understand how ideas and technologies spread across time and space.

Despite the immense potential that such data holds for better understanding cultural evolution, it remains difficult to interpret, since it involves tens of thousands memes, corresponding to millions of visual elements. Somewhat like genomics research, it is now becoming essential to develop a microscope to observe memetic interactions more closely.
Objectives: In this project, the student will tackle the challenge of designing and building a prototype interface for the exploration of the memes on the basis of replicated visual elements. The scientific challenge is to create a design that reflects the layered and interconnected nature of the data, composed of visual elements, maps, and larger sets. The student will develop its project by working with digital embeddings of replicated visual elements, and digitised images in IIIF framework. The interface will make use of the metadata to visualise how time, space, as well as the development of new technologies influence cartographic figuration, by filtering the visual elements. Finally, to reflect the multi-layered nature of the data, the design must be transparent and provide the ability to switch between visual elements and their original maps.
The project will draw on an exceptional dataset of tens of thousands of American, French, German, Dutch, and Swiss maps published between 1500 and 1950. Depending on the student’s interests, an interesting case study to demonstrate the benefits of the interface could be to investigate the impact of the invention of lithography, a revolutionary technology from the end of the 18th century, on the development of modern cartographic representations.
Prerequisites: Web Programming, basics of JavaScript.
Type: MSc (12 ECTS) or BA (8ECST) Semester project
Sections: Digital humanities, Computer Science
Supervisor: Beatrice Vaienti
Keywords: OCR, database
Number of students: 1
Context: An ongoing project at the Lab is focusing on the creation of a 4D database of the urban evolution of Jerusalem between 1840 and 1940. This database will not only depict the city in time as a 3D model, but also embed additional information and metadata about the architectural objects, thus employing the spatial representation of the city as a relational database. However, the scattered, partial and multilingual nature of the available historical sources underlying the construction of such a database makes it necessary to combine and structure them together, extracting from each one of them the information describing the architectural features of the buildings.
Two architectural catalogues, “Ayyubid Jerusalem” (1187-1250) and “Mamluk Jerusalem: an Architectural Study”, contain respectively 22 and 64 chapters, each one describing a building of the Old City in a thorough way. The information they provide includes for instance the location of the buildings, their history (founder and date), an architectural description, pictures and technical drawings.
Objectives: Starting from the scanned version of these two books, the objective of the project is to develop a pipeline to extract and structure their content in a relational spatial database. The content of each chapter is structured in sections that cover systematically the various aspects of each building’s architecture and history. Along with this already structured information photos and technical drawings are accompanying the text: the richness of the images and architectural representations in the books should also be considered and integrated in the project outcomes. Particular emphasis will be placed on the extraction of information about the architectural appearance of buildings, which can then be used for procedural modelling tasks. Given these elements, three main sub-objectives are envisioned:
- OCR ;
- Organization of the extracted text in the original sections (or in new ways) and extraction of the information pertaining the architectural features of the buildings;
- Encoding the information in a spatial DB, using the locational information present in each chapter to geolocate the building, eventually associating its position with the existing geolocated building footprints.
Prerequisites: basic knowledge of database technologies and Python

Type: MSc Semester project or Master project
Sections: Computer Science, Data Science, Digital humanities
Supervisor: Maud Ehrmann, as well as historians and network specialists from the C2DH Center from Luxembourg University.
Keywords: Document processing, NLP, machine learning
Context: News agencies (e.g. AFP, Reuters) have always played an important role in shaping the news landscape. Created in the 1830s and 1840s by groups of daily newspapers in order to share the costs of news gathering (especially abroad) and transmission, news agencies have gradually become key actors in news production, responsible for providing accurate and factual information in the form of agency releases. While studies exist on the impact of news agency content on contemporary news, little research has been done on the influence of news agency releases over time. During the 19C and 20C, to what extent did journalists rely on agency content to produce their stories? How was agency content used in historical newspapers, as simple verbatims (copy and paste) or with rephrasing? Were they systematically attributed or not? How did news agency releases circulate among newspapers and which ones went viral?
Objective: Based on a corpus of historical Swiss and Luxembourgish newspapers spanning 200 years (i.e., the impresso project’s corpus), the goal of this project is to develop a machine learning-based classifier to distinguish newspaper articles based on agency releases from other articles. In addition to detecting agency releases, the system could also identify the news agency behind the release.
Main steps:
- Ground preparation with a) exploration of the corpus (via the impresso interface) to become familiar with historical newspaper articles and identify typical features of agency content as well as potential difficulties; b) compilation of a list of news agencies active throughout the 19C and 20C.
- Construction of a training corpus building (in French), based on:
- sampling and manually annotation;
- a collection of manually labelled agency releases, which could be automatically extended by using previously computed text reuse clusters (data augmentation).
- Training and evaluation of two (or more) agency release classifiers:
- a keyword baseline (i.e. where the article contains the name of the agency);
- a neural-based classifier
Master project – If the project is taken as a master project, the student will also work on the following:
Processing:
- Multilingual processing (French, German, Luxembourgish);
- Systematic evaluation and comparison of different algorithms;
- Identification of the news agency behind the release;
- Fine-grained characterisation of text modification;
- Application of the classifier to the entire corpus.
Analysis:
- Study of the distribution of agency content and its key characteristics over time (in a data science fashion).
- Based on the computed data, study of information flows based on a network representation of news agencies (node), news releases (edges) and newspapers (node).
Eventually, such work will enable the study of news flows across newspapers, countries, and time.
Requirements: Good knowledge in machine learning, data engineering and data science. Interest in media and history.

Fall 2022
There were a few projects only and we no longer have the capacity to host projects for that period. Check out around December 2022 what will be proposed for Spring 2023!
Spring 2022

Type of project: Semester
Supervisors: Sven Najem-Meyer (DHLAB) Matteo Romanello (UNIL).
Context: Optical Character Recognition aims at transforming images into machine-readable texts. Though neural networks helped to improve performances, historical documents remain extremely challenging. Commentaries to classical Greek literature epitomize this difficulty, as systems must cope with noisy scans, complex layouts and mixed Greek and Latin scripts.
Objective: The objective of the project is to produce a system that can solve the problem of this highly complex data. Depending on your skills and interests, the project can focus on :
- Image pre-processing
- Multitasking : can we improve OCR, by processing task like layout analysis or named-entity recognition in parallel?
- Benchmarking and fine-tuning available frameworks
- Optimizing post-processing with NLP
Prerequisites:
- Good skills in python ; libraries such as OpenCV or PyTorch are a plus.
- Good knowledge in machine learning is advised, bases in computer vision and image processing would be a real plus.
- No knowledge of ancient Greek/literature is required.
Type of project: Semester
Supervisor: Maud Ehrmann
Objective: Given a large archive of historical newspapers (cf. the impresso project) containing both text and image material, the objective of this project is to :
1/ Learn a model to classify images according to their types e.g. map, photograph, illustration, comics, ornament, drawing, caricature).
This first step will consist in:
- the definition of the typology by looking at the material – although the targeted typology will be rather coarse;
- the annotation of a small training set;
- the training of a model by fine-tuning an existing visual model;
- the evaluation of the said model.
2/ Apply this model on a large-scale collection of historical newspapers (inference), and possibly do the statistical profile of the recognized elements through time
Required skills:
- basic knowledge in computer vision
- ideally experience with PyTorch

Type of project: Semester (ideally done in loose collaboration with the other semester project on image classification)
Supervisor: Maud Ehrmann
Objective: Given a large archive of historical newspapers (cf. the impresso project) containing both text and image material, the objective of this project is to :
1/ Learn a model for binary image classification: map vs. non-map.
This first step will consist in:
- the annotation of a small training set (this step is best done in collaboration with project on image classification);
- the training of a model by fine-tuning an existing visual model;
- the evaluation of the said model.
2/ Learn a model for map classification (which country or region of the world is represented)
- first exploration and qualification of map types in the corpus.
- building of a training set, prob. with external sources
- the training of a model by fine-tuning an existing visual model;
- the evaluation of the said model.
Required skills:
- basic knowledge in computer vision
- ideally experience with PyTorch
Type of project: Semester
Supervisors: Matteo Romanello (DHLAB), Maud Ehrmann (DHLAB), Andreas Spitz (DLAB)
Context: Digitized newspapers constitute an extraordinary goldmine of information about our past, and historians are among those who can most benefit from it. Impresso, an ongoing, collaborative research project based at the DHLAB, has been building a large-scale corpus of digitized newspapers: it currently contains 76 newspapers from Switzerland and Luxembourg (written in French, German and Luxembourgish) for a total of 12 billion tokens. This corpus was enriched with several layers of semantic information such as topics, text reuse and named entities (persons, locations, dates and organizations). The latter are particularly useful for historians as co-occurrences of named entities often indicate (and help to identify) historical events. The impresso corpus currently contains some 164 million entity mentions, linked to 500 thousand entities from DBpedia (partly mapped to Wikidata).
Yet, making use of such a large knowledge graph in an interactive tool for historians — such as the tool impresso has been developing — requires an underlying document model that facilitates the retrieval of entity relations, contexts, and related events from the documents effectively and efficiently. This is where LOAD comes into play, which is a graph-based document model that supports browsing, extracting and summarizing real world events in large collections of unstructured text based on named entities such as Locations, Organizations, Actors and Dates.
Objective: The student will build a LOAD model of the impresso corpus. In order to do so, an existing Java implementation, which uses MongoDB as its back-end, can be used as a starting point. Also, the student will have access to the MySQL and Solr instances where impresso semantic annotations have already been stored and indexed. Once the LOAD model is built, an already existing front-end called EVELIN will be used to create a first demonstrator of how LOAD can enable entity-driven exploration of the impresso corpus.
Required skills:
- good proficiency in Java or Scala
- familiarity with graph/network theory
- experience with big data technologies (Kubernetes, Spark, etc.)
- experience with PostgreSQL, MySQL or MongoDB
Note for potential candidates: In Spring and Fall 2020, two students have already been working on this project, but work and research perspectives are far from being exhausted and many things remain to be explored. We therefore propose to pursue, and the focus of the next project’s edition will be adapted according to the candidate’s background and preferences. Do not hesitate to get in touch!
Type of project : Master/Semester thesis
Supervisors: Frédéric Kaplan
Semester of project: Spring 2021
Project summary: This project aims at exploring the relevance of Transformers architectures for 3D cloud points. In a first series of experiments, the student will use the large 3D cloud points of models produced at the DHLAB for the city of Venice or the City of Sion and tries to predict missing parts. In a second series of experiments the student will use the SRTM (Shuttle Radar Topography Mission, a NASA mission conducted in 2000 to obtain elevation data for most of the world) to encode / decode terrain prediction.
Contact: Prof. Frédéric Kaplan
Type of project: Semester
Supervisors: Didier Dupertuis and Paul Guhennec.
Context: The Federal Office of Topography (Swisstopo) has recently made accessible high-resolution digitizations of historical maps of Switzerland, covering every year between 1844 and 2018. In order to be able to use these assets for geo-historical research and analysis, the information contained in the maps must be converted from its visual representation to an abstract, geometrical, form. This conversion from an input raster to an output vector geometry is typically well-performed by combining Convolutional Neural Networks for pixelwise classification to standard computer vision techniques, but might prove challenging for datasets with a larger figurative diversity, like in the case of Swiss historical maps, whose style varies over time.
Objective: The ambition of this work is to develop a pipeline capable of transforming the buildings and roads of the Swisstopo set of historical maps into geometries correctly positioned in an appropriate Geographic Coordinates System. The student will have to train a pre-existing semantic segmentation neural network on a manually annotated training set, evaluate its success, and fine-tune it for it to be applicable on the entire dataset. Depending on the interest of the student, some first diachronic analyses of the road network evolution in Switzerland can be considered.
Prerequisites:
- Good skills with Python.
- Basics in machine learning and computer vision are a plus.
Type of project: Semester
Supervisors: Rémi Petitpierre (IAGS), Paul Guhennec (DHLAB)
Context: The Bibliothèque Historique de la Ville de Paris digitised more than 700 maps covering in detail the evolution of the city from 1836 (plan Jacoubet) to 1900, including the famous Atlases Alphand. They are of a particular interest in the urban studies of Paris, which was at the time heavily transfigured by Haussmanian transformations. For administrative and political reasons, the City of Paris did not benefit from the large cadastration campaigns that occurred throughout Napoleonic Europe at the beginning of the 19th century. Therefore the Atlas’s sheets are the finest source of information available on the city over the 19th century. In order to make use of the great potential of this dataset, the information contained in the maps must be converted from its visual representation to an abstract, geometrical, form, on which to base quantitative studies. This conversion from an input raster to an output vector geometry is typically well-performed by combining Convolutional Neural Networks for pixelwise classification to standard computer vision techniques. In a second time, the student will develop quantitative techniques to investigate the transformation of the urban fabric.
Tasks
- Semantically segment the Atlas de Paris maps as well as the 1900 plan, train a pre-existing semantic segmentation neural network on a manually annotated training set, evaluate its success, and fine-tune it for it to be applicable on the entire dataset.
- Develop a semi-automatic pipeline to align the vectorised polygons on a geographic coordinate system.
- Depending on the student’s interest, tackle some questions such as:
- analysing the impact of the opening of new streets on mobility within the city;
- detecting the appearing of locally aligned neighbourhoods (lotissements);
- investigating the relation between the unique Parisian hygienist “ilot urbain” and the housing salubrity
- studying the morphology of the city’s infrastructure network (e.g. sewers)
Prerequisites
Good skills with Python.
Bases in machine learning and computer vision are a plus.
Supervisors: Paul Guhennec and Federica Pardini.
Context: In the early days of the 19th century and made possible by the Napoleonic invasions of Europe, a vast scale administrative endeavour started to cartography as faithfully as possible the geometry of most cities of Europe. What results from these so-called Napoleonic cadasters is a very precious testimony of the state of these cities in the past at the European scale. For matters of practicality and detail, most cities are represented on separate sheets of paper, with margin overlaps between them and indications on how to reconstruct the complete picture.
Recent work at the laboratory has shown that it is possible to make use of the great homogeneity in the visual representations of the parcels of the cadaster to automatically vectorize them and classify them according to a fixed typology (private building, public building, roads, etc). However, the process of aligning these cadasters with the “real” world, by positioning them in a Geographic Coordinate System, and thus allowing large-scale quantitative analyses, remains challenging.
Objective: A first problem to tackle is the combination of the different sheets to make up the full city. Building on a previous student project, the student will develop a process to automatically align neighbouring sheets, while accounting for the imperfections and misregistrations in these historical documents. In a second stage, a pipeline will be developed in order to align the combined sheets obtained at the previous step on contemporary geographic data.
Prerequisites:
- Good skills in Python.
- Experience with computer vision libraries

Type of project: Semester project or Master thesis
Supervisors: Didier Dupertuis and Frédéric Kaplan.
Context: In 1798, after a millennium as a republic, Venice was taken over by Napoleonic armies. A new centralized administration was erected in the former city-state. It went on to create many valuable archival documents giving a precise image of the city and its population.
The DHLAB just finished the digitization of two complementary sets of documents: the cadastral maps and its accompanying registries, the “Sommarioni”. The cadastral maps give an accurate picture of the city with clear delination of numbered parcels. The Sommarioni registers contain information about each parcel, including a one-line description of its owner(s) and type of usage (housing, business, etc.).
The cadastral maps have been vectorized, with precise geometries and numbering for each parcel. The 230’000 records of the Sommarioni have been transcribed. Resulting datasets have been brought together and can be explored in this interactive platform (only available via EPFL intranet or VPN).
Objective: The next challenge is to extract structured data from the Sommarioni owner descriptions, i.e. to recognize and disambiguate people, business and institution names. The owner description is a noisy text snippet mentioning several relatives’ names; some records only contain the information that they are identical to the previous one; institution names might have different spellings; and there are many homonyms among people names.
The ideal output would be a list of disambiguated institutions and people with, for the latter, the network of their family members.
The main steps are as follows:
- Definition of entity typology (individual, family or collective, business, etc.);
- Entity extraction in each record, handling the specificities of each type;
- Entity disambiguation and linking between records;
- Creation of a confidence score for the linking and disambiguation to quantify uncertainty, and of different scenarios for different degrees of uncertainty;
- If time permits, analysis and discussion of results in relation to the Venice of 1808;
- If time, integration of the results in the interactive platform.
Prerequisites:
- Good knowledge of python and data-wrangling;
- No special knowledge of Venetian history is needed;
- Proficiency in Italian is not necessary but would be a plus.
More explorative projects
For students who want to explore exciting uncharted territories in an autonomous manner.
Type of project : Master thesis
Supervisors: Frédéric Kaplan
Project summary: This explorative project studies the property of text kernels, sentences that are invariant after automatic translation back and forth into a foreign language. The goal is to develop a prototype of text editor / transformation pipeline permitting to associate any sentence with its invariant form.
Contact: Prof. Frédéric Kaplan
Type of project : Master thesis
Supervisors: Frédéric Kaplan
Project summary: This project aims at studying the potential of Transformers architecture for new kinds of language games between artificial agent and evolution of artificial languages. Artificial agents interact with one another about situation in the “world” and autonomously develop their own language on this basis. This project extends a long series of experiment that started in the 2000s.
Contact: Prof. Frédéric Kaplan
Type of project : Master/Semester thesis
Supervisors: Frédéric Kaplan
Project summary: This explorative project consists in mining a large collection of novels to automatically extract characters and create an automatically generated dictionary of the novel characters. The characters will be associated with the novels in which they appear, with the characters they interact with and possibly with specific traits.
Contact: Prof. Frédéric Kaplan
Past projects
Autumn 2022
Project type: Master
Supervisors: Maud Ehrmann and Simon Clematide (UZH)
Context: The impresso project aims at semantically enriching 200 years of newspapers archives by applying a range of NLP techniques (OCR correction, named entity processing, topic modeling, text reuse, etc.). The source material comes from Swiss and Luxembourg national libraries and corresponds to the fac-similes and OCR outputs of ca. 200 newspaper titles in German and French.
Problems
- Automatic processing of these sources is greatly hindered by the sometimes low quality of legacy OCR and OLR (when present) processes: articles are incorrectly transcribed and incorrectly segmented. This has consequences on downstream text processing (e.g. a topic with only badly OCRized tokens). One solution is to filter out elements before they enter the text processing pipeline. This would imply recognizing specific parts of newspaper pages known to be particularly noisy such as: meteo tables, transport schedules, cross-words, etc.
- Additionally, besides advertisement recognition, OLR does not provide any section classification (is this segment a title banner, a feuilleton, an editorial, etc.) and it would be useful to provide basic section/rubrique information.
Objectives: Building on a previous master thesis which explored the interplay between textual and visual features for segment recognition and classification in historical newspapers (see project description, master thesis, and published article), this master project will focus on the development, evaluation, application and release of a documented pipeline for the accurate recognition and fine-grained semantic classification of tables.
Tables present several challenges, among others:
- as usual, difference across time and sources;
- visual clues can be confusing:
- presence of mixed layout: table-like part + normal text
- existence “quasi-tables” (e.g. lists)
- variety of semantic classes: stock exchanges, TV/Radio program, transport schedules, sport results, events of the day, meteo, etc.
Main objectives will be:
- Creation and evaluation of table recognition and classification models.
- Application of these models on large-scale newspaper archives thanks to a software/pipeline which will be documented and released in view of further usage in academic context. This will support the concrete use case of specific dataset export by scholars.
- (bonus) Statistical profile of the large-scale table extraction data (frequencies, proportion in title/pages, comparison through time and titles).
Spring 2021
Type of project: Semester
Supervisors: Albane Descombes, Frédéric Kaplan.
Photogrammetry is a 3D modelling technique which enables making highly precise models of our built environment, thus collecting lots of digital data on our architectural heritage – at least, as it remains today.
Over the years, a place could have been recorded by drawing, painting, photographing, scanning, depending on the evolution of measuring techniques.
For this reason, one has to mix various media to show the evolution of a building through the centuries. This project proposes to study the techniques which enable to overlay images over 3D photogrammetric models of Venice and of the Louvre museum. The models are provided by the DHLab, and were computed in the past years. The images of Venice come from the photo library of the Cini Foundation, and the images of Paris can be collected on Gallica (the digital French Library).
Eventually this project will deal with the issues of perspective, pattern and surface recognition in 2D and 3D, customizing 3D viewers to overlay images, and showcasing the result on a web page.

Type of project: Master
Supervisors: Albane Descombes, Frédéric Kaplan, Alain Dufaux.
A collection rich of hundreds of magnetic tapes contains the association’s first radio shows, recorded in the early 90’s after its creation. They include interviews and podcasts about Vivapoly, Forum EPFL or Balélec, which set the pace of every students’ life on campus since many years.
This project aims at studying the existing methods for digitizing magnetic tapes in the first place, then building a browsable database with all the digitized radio shows. The analysis of this audio content will be done using adapted speech recognition models.
This project is done in collaboration with Alain Dufaux, from the Cultural Heritage & Innovation Center.

Type of project : Master/Semester thesis
Supervisors: Paul Guhennec, Fabrice Berger, Frédéric Kaplan
Semester of project: Spring 2021
Project summary: The project consists in the development of a scriptable pipeline for producing procedural architecture using the Houdini 3D Procedural environment. The project will start from existing Procedural models of the Venice in 1808 developed by at the DHLAB and automatise a pipeline to script the model out historical information recorded about each parcel.
Contact: Prof. Frédéric Kaplan
Type of project : Master thesis
Supervisors: Maud Ehrmann, Frédéric Kaplan
Semester of project: Spring 2021
Project summary: Thanks the digitisation and transcription campaign conducted during the Venice Time Machine, a digital collection of secondary sources is offering a arguably complete covering about all the historiography concerning Venice and its population at the 19th century. Through a manual and automatic process, the project will identify a series of hypotheses concerning the evolution of Venice functions, morphology and proprietaries network. These hypotheses will be translated in a formal language, keeping a direct link with the books and journals where they are expressed and systematically tested against the model of the city of Venice established through the integration of the models of the cadastral maps. In some cases, the data of the computational model of the city may go in contradiction with the hypotheses of the database and this will lead either to a revision of the hypotheses or a revision of the computational models of Venice established so far.
Type of project : Master/Semester thesis
Supervisors: Didier Dupertuis, Paul Guhennec, Frédéric Kaplan
Semester of project: Spring 2021
Project summary: As part of the ScanVan project, the city of Sion has been digitised in 3D. The goal is now to generate images of the city by days and nights and for the different seasons (summer, winter, spring and autumn) of the city. Contrastive Unpaired Translation architecture like the one used for transforming Venice images will be used for this project.
Contact: Prof. Frédéric Kaplan
Type of project : Master/Semester thesis
Supervisors: Albanes Descombes, Frédéric Kaplan
Semester of project: Spring 2021
Project summary: This project aims at automatically transforming YouTube videos into 3D models using photogrammetry techniques. It extends the work of several Master / Semester projects that have made significant progress in this direction. The goal here is to design a pipeline that permits to georeference the extracted models and explore them with a 4D navigation interface developed at the DHLAB.
Contact: Prof. Frédéric Kaplan