Master and Semester projects

Here is a list of master and semester projects currently proposed at the DHLAB. For most projects, descriptions are initial seeds and the work can be adjusted depending on the skills and the interests of the students. For a list of already completed projects (with code and reports), see this GitHub page.

  • Are you interested in a project listed below and it is marked as available?  Write an email to the person(s) of contact mentioned in the project description, saying in which section and year you are, and possibly including a statement of your last grades.
  • You want to propose a project or are interested by the work done at the DHLAB? Write an email to Frédéric Kaplan and Maud Ehrmann, explaining what you would like to do.

Fall 2026

To express your interest in a project, please complete both steps below:

  1. Write directly to the supervisor(s) listed on the project page
  2. Fill in this form (log in with your EPFL account): https://forms.gle/oQYqpGjmf9yYLMCY8

You may apply to more than one project. For each one, please contact the relevant supervisor(s).

Supervisors will follow up with you individually. 

Type: MSc or BA Semester Project
Sections: Computer Science, Data Science, Digital Humanities
Supervisor: Frederic Kaplan
Number of students: 3-4

Context

“Vintage Language Models”, a term coined by Owain Evans,  are language models trained exclusively on historical corpora from a given period rather than on contemporary web-scale data. These models function both as “voices from the past” and as experimental tools for studying reasoning, generalization, prediction, and historical knowledge formation in AI systems.

One of the first large-scale experiments in this direction is talkie-1930, a 13B model trained on approximately 260B tokens of English text published before 1931. The project explores whether models deprived of modern knowledge can nevertheless rediscover scientific ideas, anticipate future inventions, or extrapolate historical trajectories. Early results suggest that vintage models perform significantly worse on factual knowledge tasks while remaining surprisingly competitive on reasoning, numeracy, and language understanding.

Thanks to more than a decade of work on large-scale historical datasets through projects such as Impresso, Time Machine Organisation, and Replica, the DHLAB manages some of the world’s largest “Big Data of the Past” infrastructures. These continuously expanding datasets create a unique environment for developing and evaluating Vintage LLMs.

The project also opens broader questions at the intersection of AI, digital humanities, and epistemology. Could a model trained only on texts from 1989 reinvent concepts developed during the following thirty-five years? Could existing LLMs with a 2022 or 2023 cutoff already serve as “vintage” systems for studying recent geopolitical events such as the evolution of the war in Ukraine?

Objective

The objective of this project is to prepare the methodological and technical foundations for a larger Vintage LLM research program. Students will investigate how temporally restricted language models can be constructed, evaluated, and compared across historical periods, with particular attention to contamination control, chronological training pipelines, synthetic data generation, and historically grounded evaluation benchmarks.

Depending on the composition of the team, the project may combine conceptual and technical components, ranging from corpus preparation and OCR correction to fine-tuning experiments and benchmark design.

Research questions

The project will study how language models behave when restricted to knowledge available before a specific historical date. It will investigate whether vintage models can extrapolate future scientific, technological, political, or cultural developments despite lacking direct access to later information.

Another major focus concerns temporal contamination. The project will explore how leakage occurs during pretraining, post-training, synthetic data generation, or evaluation, and how chronological training forks could reduce these effects.

The project will also examine whether synthetic historical dialogues, instruction datasets, or self-play approaches could bootstrap vintage conversational systems without introducing modern linguistic conventions or hidden contemporary knowledge.

Main steps

The project will begin with a literature review on vintage language models, temporal evaluation, contamination analysis, historical corpora, OCR correction, and synthetic data generation. Students will then study historical corpora available at DHLAB and elsewhere in order to construct chronologically coherent datasets and filtering pipelines. A second phase will focus on designing evaluation protocols for vintage models, including tasks related to scientific rediscovery, historical forecasting, geopolitical extrapolation, or technological anticipation. The project may also investigate synthetic bootstrapping strategies for instruction tuning and conversational alignment under strict temporal constraints. The final stage will consist of documenting the methodology and producing a roadmap for future large-scale Vintage LLM experiments.

Models/References

Requirements

The project requires strong programming skills and familiarity with machine learning workflows. Experience with large language models, NLP pipelines, embeddings, or transformer architectures is highly desirable. Knowledge of historical corpora, OCR systems, or digital humanities methods would be an advantage.

Type: MSc or BA Semester Project
Sections: Computer Science, Data Science, Digital Humanities
Supervisor: Frederic Kaplan
Number of students: 2-4

Context

Most language models treat their training corpora as temporally flat, ignoring the fact that language continuously evolves through changing vocabularies, narratives, meanings, and social contexts. Recent work on Temporal Language Models (TLMs), notably the TLM-1 project developed by Brandon Duderstadt and Hayden Helm, proposes a different paradigm in which time becomes an explicit dimension of language modeling. TLM-1 is a BERT-style transformer trained jointly to predict document contents and document dates on the Corpus of Contemporary American English (COCA) from 1990 to 2019. Instead of treating language as static, the model learns temporal trajectories of words, concepts, narratives, and semantic shifts. The project introduces temporal embeddings, Bayesian querying methods for correcting anachronism bias, and the notion of a “temporal control curve” that appears to reconstruct the ordinal geometry of time within the embedding space.
This research direction connects naturally with DHLAB’s long-term work on historical corpora and large-scale temporal datasets through projects such as Impresso and Time Machine Organisation. These infrastructures create opportunities to explore temporal language dynamics over much longer time spans and across multiple languages, media, and historical contexts.

Objective
The objective of this project is to investigate how language models can explicitly model temporal dynamics in language and cultural evolution. Students will study and extend approaches inspired by TLM-1 in order to analyze semantic drift, narrative evolution, temporal forecasting, and historical language change.
The project may involve both methodological and experimental components, including temporal embeddings, Bayesian temporal querying, diachronic corpora construction, and forecasting experiments based on extrapolation in temporal embedding spaces.


Research Questions
The project will investigate how temporal information can be integrated directly into language model training and querying. It will explore whether temporal embeddings recover meaningful historical trajectories and whether they can be used to model semantic evolution, ideological shifts, or changing cultural narratives.
Another major question concerns forecasting. Can temporal language models extrapolate future linguistic or cultural trends by extending learned temporal curves? Can they detect early signals of political, technological, or societal transformations embedded in large corpora?
The project may also study temporal contamination and anachronism bias, investigating how Bayesian correction methods can disentangle true temporal signals from artifacts introduced by training distributions or prompting.


Main Steps
The project will begin with a review of work on temporal language models, diachronic embeddings, semantic drift detection, and temporal NLP. Students will then study available temporal corpora at DHLAB and elsewhere, including newspapers, books, or archival collections, and prepare chronologically structured datasets. A second phase will focus on reproducing or extending elements of the TLM-1 framework, including temporal embeddings, year tokens, Bayesian temporal querying, or temporal evaluation protocols.
The project may also include experiments on semantic evolution, narrative forecasting, or historical trend analysis using temporal embedding trajectories. The final stage will consist of documenting the methodology, evaluating the results, and proposing future directions for large-scale temporal language modeling.

References

  • “A Model of the Language Process” https://www.calcifercomputing.com/reports/tlm
    TLM-1 Hugging Face repository https://huggingface.co/bstadt/tlm-1
  • Kim et al., Temporal Analysis of Language through Neural Language Models (2014) https://arxiv.org/abs/1405.3515
  • Dhingra et al., Time-Aware Language Models as Temporal Knowledge Bases (2021) https://arxiv.org/abs/2106.15110
  • Loureiro et al., TimeLMs: Diachronic Language Models from Twitter (2022) https://arxiv.org/abs/2202.03829
  • Buntinx et al Studying Linguistic Changes over 200 Years of Newspapers through Resilient Words Analysis (2017) https://www.frontiersin.org/journals/digital-humanities/articles/10.3389/fdigh.2017.00002/full


Requirements

The project requires strong programming skills and familiarity with machine learning and NLP workflows. Experience with transformers, embeddings, or historical corpora is highly desirable. Interest in computational linguistics, AI interpretability, or digital humanities would be an advantage.

Type: MSc or BA Semester Project
Sections: Computer Science, Data Science, Digital Humanities
Supervisor: Frederic Kaplan
Number of students: 1-3

Context

This project is part of a collaborative research partnership between the EPFL DHLAB and the Flickr Foundation. The student will contribute to the development of real-world tools within a purpose-driven technology non-profit, with their work directly informing product development and strategic initiatives. Several DHLAB student have successfully engage in this research collaboration in the last semesters. 

The Flickr Foundation will provide access to the Flickr Commons dataset and Flickr API, technical consultation, regular supervision sessions, and support for open publication and knowledge sharing.

Flickr hosts tens of billions of photographs documenting more than two decades of global events, cultural moments, and everyday life. Within this massive archive lie remarkable intersections of time and place, collectively recorded by millions of users through cameras and smartphones. Events such as the Notre-Dame de Paris fire, the Women’s March, the 2016 Summer Olympics, or Occupy Wall Street emerge naturally from these collective photographic traces.

However, discovering such moments retrospectively remains extremely difficult because of the scale and heterogeneity of the archive. Current discovery methods rely largely on manual keyword searches and prior knowledge of events. The project explores whether computational methods can automatically surface historically significant “intersections of time and place” through spatiotemporal analysis.

The project addresses a broader challenge faced by GLAM institutions and social media archives: how to transform overwhelming abundance into discoverable cultural memory.

Objective

The objective of the project is to design and prototype a reproducible analytical pipeline capable of detecting historically significant spatiotemporal events within large-scale Flickr archives. The student will first conduct a focused literature review on spatiotemporal clustering, burst detection, tag analysis, landmark timeline construction, and retrospective event detection. Based on this review, the student will identify the most promising methodological approaches for Flickr-scale datasets.

The project will then investigate unsupervised approaches for detecting anomalies in normal spatiotemporal activity patterns, allowing events to emerge computationally rather than through predefined keywords or categories. To validate the methodology, the student will identify and document a set of particularly compelling “star” examples of historical events discovered through the proposed pipeline.


Research Questions
The project investigates which computational methods are most effective for identifying historically significant intersections of time and place within large-scale social media archives. It will explore how spatiotemporal clustering, anomaly detection, tag dynamics, visual similarity, and metadata analysis can reveal events retrospectively without relying on explicit prior knowledge. The project may also investigate how social media traces contribute to the construction of contemporary collective memory and how platforms such as Flickr function as distributed archives of recent history.


Main Steps

The project will begin with a literature review on spatiotemporal event detection, retrospective anomaly discovery, burst analysis, geospatial clustering, and social media archives. The student will then define a bounded subset of the Flickr archive, based on a chosen geographical or temporal scope, and prepare the associated metadata and API workflows. A second phase will focus on prototyping analytical pipelines for detecting significant spatiotemporal anomalies. Different approaches may include clustering methods, density analysis, temporal burst detection, tag co-occurrence analysis, or multimodal embeddings. The student will then evaluate the results by identifying and documenting several historically meaningful events surfaced through the methodology.

The final stage will consist of documenting the full workflow, producing visualizations and case studies, and proposing recommendations for future large-scale deployment within Flickr Foundation initiatives and broader OpenGLAM infrastructures.

References

Flickr Commons and Flickr API documentation.

Previous DHLAB students work onthe FLickr dataset


Requirements

The project requires good programming and data analysis skills, ideally in Python. Familiarity with APIs, geospatial analysis, temporal data analysis, or machine learning is highly desirable. The candidate should be comfortable with experimentation, exploratory workflows, and large-scale datasets. Interest in cultural heritage, archives, digital memory, or computational history would be an advantage.

Strong documentation and visualization skills are also important, as the project aims to produce reproducible methods accessible to both technical and non-technical audiences.

Type: MSc or BA Semester Project
Sections: Computer Science, Data Science, Digital Humanities
Supervisor: Frederic Kaplan
Number of students: 1-3

Context

Pleiades is one of the most important open digital resources for the study of the ancient world. Developed as a community-built gazetteer of ancient places, it provides stable identifiers, spatial coordinates, historical names, bibliographic references, and semantic relationships for thousands of ancient locations across the Mediterranean and beyond.

At the same time, the Time Atlas infrastructure aim to build large-scale spatiotemporal representations of the past by integrating maps, texts, images, archival records, and historical datasets into a unified temporal knowledge infrastructure.

Linking Pleiades information into the Time Atlas would create a bridge between classical antiquity datasets and broader spatiotemporal historical infrastructures. Such an integration would enable temporal navigation across ancient sources, semantic linking between datasets, visualization of evolving place identities, and interoperability with other geohistorical resources managed by DHLAB and partner institutions.

The project addresses both technical and epistemological challenges related to historical gazetteers, temporal uncertainty, semantic alignment, and the representation of evolving places through time.

Objective

The objective of this project is to design and prototype the integration of Pleiades data into the Time Atlas infrastructure. The student will study the structure and semantics of the Pleiades dataset, design mapping strategies toward the Time Atlas data model, and develop prototype ingestion and visualization pipelines.

The project may also explore how ancient places can be represented within broader spatiotemporal knowledge graphs, including uncertainty, changing place names, shifting boundaries, and historical relationships between places.

Research Questions
The project will investigate how ancient gazetteers such as Pleiades can be integrated into large-scale temporal atlas infrastructures while preserving semantic richness and historical uncertainty. It will explore how spatial entities from antiquity can be aligned with modern geospatial systems and linked to other historical datasets, maps, texts, or archival collections. Another important question concerns temporality. How can changing place identities, uncertain locations, and evolving political or cultural geographies be represented computationally within a unified atlas framework? The project may also investigate interoperability standards for historical gazetteers and linked open data infrastructures in digital humanities.


Main Steps

The project will begin with a review of Pleiades data structures, APIs, ontologies, and linked open data standards used in historical gazetteers. The student will then analyze the Time Atlas data model and identify mapping strategies between the two infrastructures. A second phase will focus on developing prototype ingestion pipelines capable of importing and transforming Pleiades entities into the Time Atlas ecosystem.

The project may also include experiments in visualization, temporal navigation, semantic querying, or graph exploration of ancient places and their relationships.

The final stage will consist of documenting the integration pipeline, evaluating interoperability challenges, and proposing recommendations for future large-scale integration of ancient world datasets into Time Atlas.

References

Pleiades https://pleiades.stoa.org/

Time Atlas. http.//timeatlas.eu


Requirements

The project requires programming skills and familiarity with data processing workflows. Experience with geospatial data, APIs, linked open data, RDF, GIS systems, or databases would be highly valuable. Interest in digital humanities, ancient history, historical geography, or knowledge graphs would be an advantage.

Type: MSc or BA Semester Project
Sections: Computer Science, Data Science, Digital Humanities
Supervisor: Frederic Kaplan
Number of students: 1-2

Context

Large collections of digitized manuscripts are now accessible through the International Image Interoperability Framework (IIIF), but many remain difficult to search, analyze, and connect to other historical datasets. Images of manuscript pages often contain rich textual, visual, and spatial information, yet this content is rarely available in structured form. This project explores how recent language–vision models can transform manuscript images into searchable text and annotations. By combining image retrieval through IIIF, automatic transcription, layout analysis, and entity recognition, manuscript pages can become interoperable historical sources linked to places, periods, people, and events in the Time Atlas.

Objective

The objective of the project is to design and prototype a pipeline that turns digitized manuscript pages into structured, searchable, and annotatable data. The student will fetch manuscript images through IIIF, process them with language–vision models, extract text and page structure, identify named entities and spatiotemporal information, and prepare the results for integration with maps and temporal layers in the Atlas.

Research Questions

The project will investigate how effectively language–vision models can transcribe and structure handwritten or early printed manuscript pages. It will also examine how layout, marginalia, headings, tables, illustrations, and other page elements can be detected and represented.

A central question is how extracted entities such as places, people, dates, and institutions can be linked to external gazetteers, chronological systems, and Time Atlas.

The project will also study how uncertainty should be represented, especially when transcription, entity extraction, or dating remain ambiguous.


Main Steps

The project will begin with a review of IIIF workflows, manuscript transcription methods, document layout analysis, HTR/OCR systems, and language–vision models. The student will then select one or several manuscript collections accessible through IIIF and build a small ingestion pipeline for retrieving images and metadata. A second phase will focus on processing the pages with suitable models for transcription, layout segmentation, and entity extraction. The project will then design a data model for storing text, annotations, coordinates, confidence scores, and links to places or periods.

The final stage will consist of building a searchable prototype, documenting the pipeline, and preparing recommendations for integration into the Time Atlas.

References

International Image Interoperability Framework (IIIF) : https://iiif.io/

Time Atlas : timealtas.eu


Requirements

The project requires good programming skills, preferably in Python. Experience with APIs, image processing, OCR/HTR, language–vision models would be useful. Interest in, digital humanities, historical sources, or cultural heritage infrastructures would be an advantage.

Type: MSc or BA Semester Project
Sections: Computer Science, Data Science, Digital Humanities
Supervisor: Frederic Kaplan
Number of students: 1-2

Context

Large cultural heritage collections contain millions of paintings, engravings, drawings, and illustrations associated with museums, archives, and collections. While many artworks are already geolocated according to where they are conserved today, much less information is available about the places they actually depict. Recent advances in vision-language models (VLMs) open new possibilities for automatically identifying landscapes, monuments, urban scenes, rivers, mountains, or architectural structures represented in artworks. Such approaches could transform artworks into spatial historical documents that can be explored directly within the Time Atlas infrastructure.

The Time Atlas already contains geolocated artworks linked to their conservation institutions. The goal of this project is to extend this information by associating artworks with the places they represent. The dataset may also be expanded using additional paintings, engravings, and illustrations from Europeana and related cultural heritage collections.

Objective

The objective of the project is to design and prototype methods for geolocating artworks based on their visual content. The student will investigate how recent vision-language models and multimodal embeddings can identify depicted places, landscapes, monuments, or geographic features in historical artworks and connect them to spatial entities within the Time Atlas. The project may combine computer vision, geospatial analysis, multimodal retrieval, and historical interpretation.

Research Questions

The project will investigate how effectively vision-language models can identify locations represented in paintings, engravings, and historical images despite artistic stylization, temporal transformations, or incomplete visual information. It will explore whether multimodal embeddings can connect artworks with contemporary photographs, maps, satellite imagery, or textual place descriptions. Another important question concerns ambiguity and uncertainty. How should the system represent multiple possible locations, symbolic landscapes, or historically transformed urban environments?

The project will also investigate how spatially indexed artworks can enrich geohistorical navigation and visual exploration within the Time Atlas.


Main Steps

The project will begin with a review of vision-language models, multimodal retrieval systems, image embeddings, and geolocation methods in computer vision. The student will then study the current artwork collections available in the Time Atlas and prepare an extended corpus using datasets from Europeana and related sources.

A second phase will focus on testing different VLM approaches for matching artworks with geographic locations, landmarks, or environmental features.

The project may also include experiments combining image embeddings, textual metadata, historical maps, and geographic gazetteers.

The final stage will consist of building a prototype visualization and retrieval interface, documenting the methodology, and proposing recommendations for large-scale integration into the Time Atlas.

References

  • Time Atlas : timeatlas.eu
  • Where on Earth? A Vision-Language Benchmark for Probing Model Geolocation Skills Across Scales https://arxiv.org/html/2510.10880v1
  • Image-Based Geolocation Using Large Vision-Language Models https://arxiv.org/abs/2408.09474


Requirements

The project requires good programming skills, preferably in Python. Experience with computer vision, deep learning, embeddings, or multimodal AI would be highly valuable. Interest in cultural heritage, art history, historical geography, or digital humanities would be an advantage.

Type: MSc or BA Semester Project
Sections: Computer Science, Data Science, Digital Humanities
Supervisor: Frederic Kaplan and Maud Ehrmann
Number of students: 1-2

Description coming soon

Type: MSc Semester or Master Project
Sections: Data Science, Digital Humanities, Computer Science
Supervisor: Frederic Kaplan and Maud Ehrmann
Number of students: 3-4

Context

Large language models are currently trained primarily on contemporary web data, while vast quantities of historical and cultural heritage material remain underrepresented in modern foundation models. At the same time, institutions such as libraries, archives, museums, and research infrastructures have digitized enormous collections of books, newspapers, manuscripts, maps, and archival documents that could significantly enrich future open language models. This project is connected to the preparation of a future training cycle of Apertus, an open large language model initiative aiming to integrate broader and more diverse historical and cultural datasets into foundation model training.

The project will build upon the unique historical corpora curated by the EPFL DHLAB through projects such as Impresso, Time Machine Organisation, and related digitization collaborations. Additional sources may include datasets such as the Institutional Books collection, Internet Archive collections, and components of The Pile.

The project addresses both technical and epistemological challenges related to historical OCR quality, metadata normalization, multilingual corpora, temporal balancing, contamination, copyright filtering, and large-scale dataset curation for foundation models.

Objective

The objective of the project is to prepare and structure historical training corpora for a future Apertus training cycle. The student will investigate how heterogeneous historical datasets can be collected, filtered, cleaned, normalized, documented, and transformed into high-quality training material suitable for large language model pretraining. The project may include corpus engineering, OCR quality evaluation, metadata harmonization, deduplication, multilingual balancing, temporal analysis, and contamination detection.

Research Questions

The project will investigate which types of historical corpora are most valuable for enriching foundation models and how these datasets can be integrated while preserving provenance and historical diversity. Another major question concerns quality control. How can noisy OCR, duplicated material, corrupted metadata, or temporally inconsistent documents be detected and corrected at scale? The project will also explore how to balance historical corpora across languages, centuries, genres, and geographical regions in order to avoid introducing new historical or cultural biases into training datasets. A further question concerns interoperability and reproducibility. How should historical training datasets be documented and versioned to support transparent and reusable foundation model training pipelines?

Main Steps

The project will begin with a review of existing large-scale language model datasets, historical corpora, OCR pipelines, and dataset curation practices for foundation models.

The student will then identify and evaluate candidate datasets from DHLAB infrastructures and external repositories such as Institutional Books, Internet Archive, and The Pile.

A second phase will focus on preprocessing workflows including metadata extraction, OCR quality analysis, language identification, temporal normalization, deduplication, and contamination filtering.

The project may also include experiments in corpus balancing, document segmentation, dataset packaging, and training-ready export pipelines.

The final stage will consist of documenting the proposed methodology, evaluating corpus quality, and producing recommendations for future Apertus training cycles.

References

  • Institutional Books dataset: https://huggingface.co/datasets/institutional/institutio

  • Internet Archive: https://archive.org/

  • The Pile: https://pile.eleuther.ai/


Requirements

The project requires good programming skills, preferably in Python. Experience with NLP pipelines, large datasets, OCR processing, data engineering, or machine learning workflows would be highly valuable.

Interest in language models, historical corpora, digital humanities, or cultural heritage infrastructures would be an advantage.

Type: MSc Project (PDM)
Sections: Computer Science, Data Science, Digital Humanities
Supervisor: Frederic Kaplan 
Number of students: 2-4

Context

Over the last years, the EPFL campus has progressively become a rich experimental environment for urban digital twin research. Through the Urban Digital Twin course and related DHLAB activities, multiple datasets and computational models describing the campus have been developed, including 3D campus structures, temporal occupation dynamics, food and restaurant data, mobility traces, institutional data, and educational information from IS-Academia.

These heterogeneous datasets provide the basis for a new type of experiment: the creation of a “foundation model” of the EPFL campus capable of learning the dynamics of the institution across space and time. Rather than modeling a single subsystem independently, the project aims to investigate whether a unified multimodal model can capture relationships between buildings, flows, schedules, food consumption, academic activities, mobility, and social rhythms.

Such a model could not only support forecasting tasks — predicting future states of the campus under varying conditions — but also enable counterfactual exploration. What would have happened if specific infrastructures had changed? How would different schedules, policies, weather conditions, or food strategies have altered campus dynamics? The project therefore sits at the intersection of urban simulation, multimodal machine learning, digital twins, and computational social science.

Objective

The objective of this project is to prototype a first EPFL Digital Twin Foundation Model by reusing datasets and models developed within the Urban Digital Twin ecosystem.

The student team will investigate how heterogeneous campus datasets can be aligned into a unified spatiotemporal representation and how machine learning architectures can model the evolving state of the campus.

The project may include forecasting experiments, representation learning, multimodal embeddings, graph neural networks, sequence modeling, and counterfactual simulation workflows.

Research Questions

The project will investigate how a university campus can be represented as a unified spatiotemporal system integrating buildings, mobility, food consumption, academic schedules, and institutional activities.

Another central question concerns multimodal representation learning. Can a shared embedding space capture the dynamics of campus life across heterogeneous datasets and temporal scales?

The project will also explore forecasting and simulation capacities. To what extent can the model predict future campus states, detect anomalies, or simulate alternative historical trajectories under modified conditions?

A further question concerns interpretability and governance. How can foundation models for urban environments remain explainable, auditable, and useful for institutional decision-making?

Main Steps

The project will begin with a review of foundation models, urban digital twins, multimodal representation learning and spatiotemporal forecasting methods.

The student team will then identify and harmonize available EPFL datasets, including campus geometry, IS-Academia data, restaurant and food system datasets, mobility traces, schedules, and other temporal infrastructures developed during the Urban Digital Twin course.

A second phase will focus on designing a unified spatiotemporal representation of the campus and testing suitable modeling approaches such as graph-based architectures, transformers, multimodal embeddings, or sequence prediction systems.

The project may also include experiments in forecasting occupancy, mobility, restaurant usage, or energy-related dynamics, as well as counterfactual simulations exploring alternative campus scenarios.

The final stage will consist of documenting the architecture, evaluating predictive and simulation capacities, and proposing a roadmap for future large-scale EPFL digital twin foundation models.

References

Urban Digital Twin course : https://edu.epfl.ch/coursebook/en/urban-digital-twins-URB-410


Requirements

The project requires strong programming and machine learning skills. Experience with Python, deep learning frameworks, graph neural networks, time-series analysis, or multimodal AI would be highly valuable.

Type: MSc Research (Semester) Project
Sections: Computer Science, Data Science, Digital Humanities
Supervisor:  Hamest Tamrazyan, Camil Hamdane
Number of students: 1

Context

Large language models (LLMs) are increasingly used for text classification in sensitive domains, including the analysis of cultural heritage and historical narratives. However, tasks such as identifying propaganda/neutrality or manipulation of narratives are normative, as they depend on context-specific interpretations shaped by geography, politics, and history. While prior research on LLM bias has focused mainly on demographic or political dimensions, less attention has been given to how these models handle culturally situated narratives in classification settings. This project addresses this gap by examining how open-source LLMs classify potential manipulation of cultural heritage across multiple geographical contexts, with the aim of identifying systematic biases, cross-regional differences, and the extent to which model outputs align with dominant historical narratives. 

In this project, we are working with a corpus of wikipedia edits related to the cultural heritage of Armenia and Ukraine, classified by LLMs as being potentially manipulative. We define as weaponized a wikipedia edit if the change alters the meaning, framing, interpretation,contextual understanding or readability of the text in a way that may be derogatory, manipulative or ideologically significant. We are actively producing human-labeled data as a ground-truth with an interdisciplinary team between UNIL and EPFL. 

As an extension, this project also aims at exploring ways of mitigating said bias. Recent works have shown that prompt engineering, multi-perspective or multi-agent systems can reduce bias in LLM generation towards social issues. In the context of the broader study of cultural heritage manipulation on wikipedia, the reduction of bias is a needed step towards a fair assessment of weaponization.

ObjectivesSemester: 

  • Develop an understanding of the biases of LLMs in the context of cultural heritage weaponization. 
  • Implement a robust evaluation paradigm for bias analysis. 
  • If time permits, implement a simple bias mitigation pipeline to improve classification against human labeled data 
  • Collaborate with the CROSS 2026 researchers to help complete an understanding of cultural heritage weaponization in Wikipedia 

Research Questions: 

  • Are some LLMs more biased than others?
  • Are some LLMs more oriented towards a certain cultural background?
  • Are the differences in classification due to performance or to political/cultural bias?
  • What kind of strategies can we use to mitigate bias in this classification task? 

Main Steps: 

  • Clarifying metrics and production/expansion of the dataset Bias analysis with humans in the loop
  • Eventual refinement of the classification pipeline 

References: 

  • Mushtaq, Abdullah, et al. “WorldView-Bench: A Benchmark for Evaluating Global Cultural Perspectives in Large Language Models.” Journal of Artificial Intelligence Research, vol. 85, Apr. 2026. www.jair.org, https://doi.org/10.1613/jai r.1.19001 
  • Sukiennik, Nicholas, et al. “An Evaluation of Cultural Value Alignment in LLM.” arXiv:2504.08863, arXiv, 11 Apr. 2025. arXiv.org, https://doi.org/10.48550/arXiv.2504.08863. 
  • Abdullah, Ateeb Ather M, Kolesnikova O, Sidorov G. Detection of Biased Phrases in the Wiki Neutrality Corpus for Fairer Digital Content Management Using Artificial Intelligence. Big Data and Cognitive Computing. 2025; 9(7):190. http s://doi.org/10.3390/bdcc9070190 
  • Ashkinaze, Joshua, et al. “Seeing Like an AI: How LLMs Apply (and Misapply) Wikipedia Neutrality Norms.” arXiv:2407.04183, arXiv, 9 Apr. 2026. arXiv.org, https://doi.org/10.48550/arXiv.2407.04183. 

Requirements: 

  • Strong Python programming skills, experience with version control (e.g., Git) and common ML/NLP libraries (e.g., PyTorch, Hugging Face Transformers). 
  • Experience with LLMs (OpenAI framework, API calling…) and NLP techniques (prompt engineering, zero-shot, few- shot, multi-perspective…) 
  • Experience with text classification workflows (data preprocessing, prompting, inference. Ability to write clean, reproducible, and well-documented code for research purposes.
  • General knowledge of the Wikipedia editing framework
  • Interest in digital cultural heritage, or political and historical narratives, is a plus.
  • Experience with the implementation of LLM fine-tuning or MAS (Mutli-Agent-Systems) is a plus 

Type: MSc Research (Semester) Project
Sections: Computer Science, Data Science, Digital Humanities
Supervisor: Hamest Tamrazyan, Camil Hamdane
Number of students: 3

Overview

This call proposes a set of coordinated Master-level projects within the ArmEpiC (Armenian Epigraphic Corpus) initiative, a research framework developed at EPFL for the standardized representation, integration, and analysis of epigraphic data.

The project’s focus on the creation and expansion of a new corpus:

IndiaArmEpiC — Armenian Epigraphic Heritage of India (EAP1721)

This corpus documents inscriptions from Armenian communities in Madras (Chennai), Mumbai, Kolkata, and Hyderabad, spanning from the 17th century onward. The dataset is based on the EAP1721 project (British Library, Endangered Archives Programme).

Scientific Context

Armenian inscriptions represent a critical source for understanding diasporic networks, trade, religion, and cultural identity. However, such materials are often:

  • fragmented across archives
  • inconsistently documented
  • difficult to access computationally

The ArmEpiC framework addresses this by providing:

  • a TEI/EpiDoc-based data model
  • an authority-driven structure (multi-tier)
  • geospatial and temporal integration
  • interoperability with large-scale infrastructures

The present projects aim to:

  • transform the India corpus into a fully structured digital dataset
  • develop AI-assisted workflows for data enrichment
  • create visual and analytical access tools
  • contribute to international infrastructures such as the Time Machine ecosystem

Project Structure

The call is organized into four complementary MSc semester projects, which can be undertaken independently or in coordination:

  1. EpiDoc Corpus Creation & Geolocation
  2. AI-Assisted Authority Files & Validation Dashboard
  3. Standalone Web Visualisation
  4. Armenian Editorial Signs Converter

Each project produces a distinct, assessable output while contributing to a shared infrastructure.

References

Requirements

  • Python (pandas, lxml)
  • Basic XML/TEI knowledge
  • Interest in cultural heritage data

Type: MSc, BA Semester project, or Master Project
Sections: Computer Science, Data Science, Digital Humanities
Supervisor: Alexander Rusnak
Number of students: 1

Context:

In regards to vision-language foundation models, there is a persistent tension between dense (per-patch or per-pixel) features and global features which represent the semantic content of an entire image – two emblematic examples of this are DINO style models for dense features and CLIP style models for global features. Recently, different techniques have tried to bridge this gap such as Talk2DINO, TIPS, and RADSeg – meanwhile agglomerative models like C-RADIO or EUPE have attempted to create more functional foundation models which distill knowledge from multiple other models. In order to properly utilize multimodal queries at the native semantic resolution of C-RADIO or EUPE, and to improve SOTA open vocabulary semantic segmentation, it is necessary to build a dense language encoder for these agglomerative models.

Objective: The primary objective of this project is to build a dense text encoder, similar to Talk2DINO but with improvements from other approaches, particularly for the EUPE family of models.

Research Questions:

  1. What is the optimal adapter architecture and dataset mixture for training a dense text encoder for EUPE? 
  2. Can this same general training recipe be applied to other foundation models (i.e. C-RADIOv4, DINOv3, etc) equally as successfully? 
  3. Can the same dense encoder be applied across various backbones without retraining or with a small adapter for transfer learning under the assumption that stronger agglomerative models will produce more similar feature maps?
  4. Can we upscale the language-adapter from patch-level features to pixel level features using techniques like AnyUp?

Main Steps:

  1. Read and understand relevant research papers
  2. Assemble public open-vocabulary semantic segmentation datasets
  3. Implement a dense text encoder with techniques from relevant research that aligns text features with the EUPE dense feature sets at appropriate dimension sizes, such that you can perform zero shot OVSS with cosine similarity
  4. Test on  benchmarks
  5. Test transferability between various models with/ without additional training 

References:

Requirements: Strong ability with Pytorch

Type: MSc, BA Semester project, or Master Project
Sections: Computer Science, Data Science, Digital Humanities
Supervisor: Alexander Rusnak
Number of students: 1

Context: Recent work in joint embedding architectures have framed violations of predictive expectation when presented with physically impossible inputs as a method for measuring how strong a developed “world model” is. Similarly, embedding-space trajectories in literature have been used as a measure of semantic diversity and as a tool for plot analysis. These techniques have not been applied to embedded urban trajectories as a method for measuring the formal congruity of the built environment – either for autonomous driving or for architectural understanding – though similar techniques have demonstrated usefulness in route planning for autonomous navigation and collision detection. 

Objective: The primary goal of this project is to develop a suite of embedding space semantic trajectory tools based on recent research, and then to apply them to urban trajectories derived from either driving videos embedded with V-JEPA, street-view imagery embedded with various image encoders, LiDAR scans / other 3D city models embedded with Utonia or other encoders, or text data describing various buildings before comparing whether there is any geographic consistency between the predictions across the modalities.   

Research Questions:

  1. Can urban trajectory analysis under a “violation of expectation” framework reliably differentiate between different urban forms (i.e. styles of architectures or city layouts) across different input modalities?
  2. Do varying modalities agree in which areas of the city are most “incongruous” or difficult to predict?
  3. How simple is it to train a trajectory prediction head on top of static embeddings derived from data which lacks an inherent time dimension? 

Main Steps:

  1. Replicate relevant semantic trajectory research into a generalized set of tools which can take embedding sequences of different types.
  2. Utilize various pre-trained models to test semantic trajectory consistency in an urban context – on public benchmarks and internal lab datasets
  3. Compare levels of high semantic surprise or irregularity for geographic clusters
  4. Train bespoke world model for multimodal urban trajectory prediction

References: 

Requirements: Strong programming skills with PyTorch

Type: MSc, BA Semester project, or Master Project
Sections: Computer Science, Data Science, Digital Humanities
Supervisor: Alexander Rusnak
Number of students: 1

Context: The rise of zero-shot 3D geometry prediction using neural networks has opened 3D reconstruction tasks where traditional SfM approaches have failed. Simultaneously, the DHLAB has collected and spatiotemporally aligned a large collection of historical photographs, drawings, and paintings of buildings and cityscapes – some of which also have associated 3D models when the buildings depicted are still standing. 

Objective: The goal of this project is to test the out of the box ability of SOTA geometry transformers to generate coherent 3D models when given images with a variety of rendering styles or color palettes, and then to finetune a model to maximize this ability. This model will be applied to the various real historic image sets available in the lab to generate synthetic 3D models of destroyed or lost buildings from the past. 

Research Questions:

  1. How good are geometry transformers at dealing with non photographic inputs / input sets that have multiple styles?
  2. Can we improve their reconstruction performance by fine-tuning or training on image / 3D reconstruction datasets where the images have been subject to style transfer?
  3. How effective are these at reproducing historical buildings from our available real world datasets? 

Main Steps:

  1. Implement relevant feed-forward reconstruction models (i.e. MapAnything, VGGT, NOVA3R)
  2. Test reconstruction performance on public 3D datasets with natural vs. stylized images
  3. Fine-tune or re-train the most performant model to improve its multi style performance
  4. Utilize best model to create historical reconstructions of buildings or cityscapes

References:

Requirements: Strong programming skills with PyTorch

Type: MSc, BA Semester project, or Master Project
Sections: Computer Science, Data Science, Digital Humanities
Supervisor: Alexander Rusnak
Number of students: 1

Context: 3D-LLMs have shown a strong ability to answer questions about small scale scenes or rooms – but have yet to be successfully deployed on large scale 3D models. Utilizing aligned, semantically enriched 3D and historical data from the DHLAB’s Time Atlas software, and the novel multimodal query architecture designed in prior works, this project is poised to expand 3D-MLLMs to city size scenes.

Objective:  This project aims to implement a 3D-MLLM for multimodal question answering at city-scale, and integrate it into the Time Atlas software.

Research Questions:

  1. How can we efficiently surface relevant data to query scenes of this size without processing every single data point?
  2. Can we use the aligned historical data to create a unique city-scale 3D query benchmark dataset?
  3. What is the best way to do 3D feature fusion with an existing MLLM to enable effective and accurate 3D and multimodal queries?

Main Steps:

  1. Utilize existing multimodal urban data to construct a relevant benchmark for city-scale 3D QA
  2. Given projected VLM features, train a LLaVA style MLLM adapter for 3D VQA on the benchmark / other 3D benchmarks.
  3. Improve architecture or techniques to maximize accuracy and speed 4. Improve pre-LLM token filtering to improve computational efficiency 

References: 

Requirements: Strong programming skills with PyTorch

Type: MSc, BA Semester project, or Master Project
Sections: Computer Science, Data Science, Digital Humanities
Supervisor: Alexander Rusnak
Number of students: 1

Context: Recent research in representation learning has revealed multiple interesting phenomena in the interpretation of self-supervised models and their associated representational geometries: conceptual abstraction plays an important role in producing generalizable representations, trained auxiliary reconstruction models that enforce sparsity onto very wide layers (i.e. sparse autoencoders or cross layer transcoders) are an effective vehicle for examining these conceptual abstractions, and the structure of the representations produced by VLMs and LLMs appear to be converging to a shared set of coherent local neighborhood relations. Despite these parallel developments, most research into representational convergence is applied to the raw activations produced by various layers of neural networks, rather than across the conceptual abstractions surfaced by auxiliary interpretation models.

Objective: The objective of this project is to develop a novel representation convergence comparison technique which evaluates the similarity of conceptual abstractions produced by a trained sparse autoencoder or cross layer transcoder across models. 

Research Questions:

  1. Do SAEs based on self-supervised vision models trained across different datasets produce similar representational geometries?
  2. Is this similarity tied to model benchmark performance
  3. Is there similarity across modalities as well?
  4. Can we train the same SAE or CLT across multiple VLMs, or apply it to representations produced by models for which it wasn’t trained with success?
  5. Can we discover hierarchical structures within these representations?  

Main Steps:

  1. Train a SAE across multiple SSL models, and utilizing multiple distinct datasets.
  2. Adopt representational convergence techniques such as k-nn neighborhood similarity or centered kernel alignment to apply to sparsified abstractions
  3. Develop a unique representational convergence measure based off of archetypal geometry in representation space
  4. Test the performance of SAE to transfer between models
  5. Train a Matryoshka SAE and determine if hierarchical representations produce more convergent or performant representations

References: 

Requirements: Strong ability with Pytorch, prior experience with representation learning

Spring 2026

Type: MSc or BA Semester Project
Sections: Computer Science, Data Science, Digital Humanities
Supervisor: Alexander Rusnak (EPFL, DHLAB)
Number of students: 1 student

Context / Objective

This project aims to examine recent research in multimodal representational convergence, most notably the recent PMLR paper “Platonic Representation Hypothesis” and its descendants, by framing the debate regarding the structure and source of knowledge discovered by deep learning models within a conflict between realist aesthetics theories and relativist aesthetic theories. More concretely, this project will test if, how, and in what context are representations and representational geometry produced by vision models influenced by the aesthetics of images. This project builds on preliminary results demonstrating the phenomenon of stronger cross-model representational geometry convergence on the AVA and APDD aesthetics datasets. 

Research questions

  • Are representations or representational geometry for aesthetic images distinct from unaesthetic images for models trained without linguistic labels (i.e. DINO, VICreg, etc)?
  • Does constructing an internationally and historically balanced aesthetics dataset eliminate the observed convergence effect or reproduce it? 
  • Is this effect caused primarily by aesthetic images occupying a smaller semantic subset of the image space (i.e. the majority of images are of mountains, portraits, classical architecture, renaissance paintings, etc) or by aesthetics as a distinct semantic grouping?

Main steps

  • Expand a dataset in development of international and historic art to create a large-scale aesthetics dataset which is not overly Western or modern in origin. 
  • Validate existing techniques (i.e. mutual nearest neighbors, unpaired latent to latent translation, etc) for testing representational convergence and investigate potential new techniques arising in deep learning and neuroscience research.
  • Utilize techniques on the international dataset, and also a large children’s art dataset under the assumption that artist age approximately maps to ability to realize their vision of aesthetics in their artwork. 
  • Devise novel methods to semantically disentangle image datasets in order to isolate particular variables (i.e. aesthetics) for convergence testing

Models/References

Requirements

  • Requirements: Python, Pytorch, OpenCV

Type: MSc or BA Semester Project
Sections: Computer Science, Data Science, Digital Humanities
Supervisor: Prof. Frédéric Kaplan (EPFL, DHLAB)
Number of students: 1–2 students (working together or on complementary sub-tasks)

Context
In collaboration with the University of Lausanne, the DHLAB starts a new project dedicated to analysing visual motifs across large collections of historical Tarot decks. The project aims to create a generic, scalable infrastructure capable of detecting repeated visual patterns and tracing their circulation across time, regions, and production networks. A first bootstrapping phase is required to prepare the corpora, define the data structures, build baseline models and evaluation tools, and explore early hypotheses.

Objective
The goal of this semester project is to do the preparatory work for the large-scale computational analysis of Tarot images.

The student(s) will:

  • Identify, collect and structure a first set of Tarot decks and metadata (public-domain priority).
  • Establish an initial taxonomy of visual motifs and their variants.
  • Design, implement and evaluate prototype computational tools to detect and classify motifs.
  • Provide recommendations for the larger infrastructure to be further developed.

Research questions

  • What is the minimal data model for representing Tarot decks, cards, and sub-motifs?
  • Which computer-vision methods (embeddings, keypoints, segmentations) perform best for detecting and classifying recurring micro-motifs across heterogeneous decks?
  • How can we evaluate motif similarity in a way that is meaningful for art-historical interpretation?
  • What preprocessing pipeline is needed to ensure scale, resolution and color consistency across decks?

Main steps

  • Corpus acquisition 
  • Corpus study and literature review to establish a first motif taxonomy
  • Data structure and management firs skeleton (data model, naming conventions, metadata schemas, storage formats).
  • Review of computer vision literature for visual motif extraction and classification
  • Design and implementation of approaches for the detection and classification of repeated motifs (wtih e.g. Vision Transformer models such as CLIP, DINO, OpenCLIP)
  • Evaluation: composition of test data, implementation of evaluation script, evaluation of approaches.
  • Recommendations for the full project: Propose a data pipeline skeleton. Identify weak points and required future modules. 

Models/References

  • CLIP
  • DINO
  • Segment Anything Model

Requirements

  • Basic knowledge of Python and deep learning frameworks (PyTorch or TensorFlow).
  • Interest in computer vision, cultural heritage, or visual studies.
  • Interest in the subject.

Type: MSc, BA Semester project, or Master project 
Sections: Digital Humanities, Data Science, Computer Science
Supervisor: Alexander Rusnak
Number of students: 1-2

Context

As part of a SwissAI grant, we will be training a foundation scale model for open-vocabularly 3D scene understanding – with a specific focus on very large scenes of cities or culturally important buildings. For further details on the technical basis for this project, see the references section.

As part of this project, there will be substantial work which needs to be done preparing dataset and testing pipelines, as well as validating ablations on the primary model / datasets.

Furthermore, this large dataset we are constructing will offer an opportunity to a motivated student to work on extending the Visual Geometry Grounded Transformer paradigm to direct egocentric semantic understanding.

Objective

  • Support the training of a large model for open-vocabulary panoptic segmentation of city-scale 3D models.
  • Research neural reconstruction methods for egocentric semantic understanding using the associated dataset.

Research Questions

  • Can synthetic data derived from large point clouds increase the capability of neural reconstruction methods?
  • What is the limit for 3D-scene scale for these reconstruction models?
  • Can the general 3D representations be easily transferred to predict multimodal semantic vectors?

Main Steps

  • Assist in dataset pipelining and implementation for our larger project
  • Implement VGGT and train an adapter for semantic understanding on public dataset (i.e. scannet or similar)
  • Test VGGT reconstruction / semantic degradation as a factor of scene size

References

Type: MSc, BA Semester project, or Master project 
Sections: Digital Humanities, Data Science, Computer Science
Supervisor: Alexander Rusnak
Number of students: 1-2

Context

As part of a SwissAI grant, we will be training a foundation scale model for open-vocabularly 3D scene understanding – with a specific focus on very large scenes of cities or culturally important buildings. For further details on the technical basis for this project, see the references section.

As part of this project, there will be substantial work which needs to be done preparing dataset and testing pipelines, as well as validating ablations on the primary model / datasets.

Furthermore, this large dataset we are constructing will offer an opportunity to a motivated student to work on building hierarchical scene representations which facilitate more granular understanding for search and robotic interaction.

Objective

  • Support the training of a large model for open-vocabulary panoptic segmentation of city-scale 3D models.
  • Research hierarchical scene representation methods based on point clouds, gaussian splats, or other primitives.

Research Questions

  • Can synthetic data derived from large point clouds increase the capability of neural reconstruction methods?
  • What is the limit for 3D-scene scale for these reconstruction models?
  • Can the general 3D representations be easily transferred to predict multimodal semantic vectors?

Main Steps

  • Assist in dataset pipelining and implementation for our larger project
  • Investigate various techniques for creating object-part hierarchies in 3D data, particularly those derived directly from per-3D-point open-vocabulary semantic labels
  • Implement multiple forms of representations of these hierarchies to test their efficacy, particularly when utilizing hyperbolic embeddings

References

Requirements: Python, Pytorch, Open3D

Type: MSc, BA Semester project, or Master project 
Sections: Digital Humanities, Data Science, Computer Science
Supervisor: Alexander Rusnak
Number of students: 1-2

Context

open-vocabularly 3D scene understanding – with a specific focus on very large scenes of cities or culturally important buildings. For further details on the technical basis for this project, see the references section. 

As part of this project, there will be substantial work which needs to be done preparing dataset and testing pipelines, as well as validating ablations on the primary model / datasets.

Furthermore, the models and datasets we are constructing will offer an opportunity to a motivated student to build cutting-edge user interfaces to enable interaction with and manipulation of large-scale 3D scenes.

Objective

  • Support the training of a large model for open-vocabulary panoptic segmentation of city-scale 3D models.
  • Research optimal graphical user interface design for interaction with large scale 3D scenes with multimodal querying and associated multimodal datasets (historical documents, sustainability data, civil engineering data, etc). 

Research Questions

  • How can we resolve multimodal queries and surface coherent results in a coherent manner for users? 
  • How can we facilitate “3D native” querying (i.e. subselecting parts of a 3D scene and using it to search for other semantically similar 3D components or multimodal results from an associated dataset)
  • How can we integrate 3D search with a chat interface and MLLM? What are the possibilities for a RAG-esque chat functionality but enabled by 3D data?

Main Steps

  • Assist in dataset pipelining and implementation for our larger project
  • Build an integrated front-end with the TimeAtlas beta for 3D native querying (https://timeatlas.eu/)
  • Research optimal multimodal querying techniques and visualizations
  • Explore retrieval augmented generation (RAG) with multimodal datasets and the interaction of MLLM outputs with city-scale point clouds

References

Requirements: Python, Pytorch, Open3D

Type: MSc, BA Semester project, or Master project 
Sections: Digital Humanities, Data Science, Computer Science
Supervisor: Frederic Kaplan
Number of students: 1-2

Context

This project is part of a collaborative research partnership between DHLAB and Flickr Foundation. Flickr Commons hosts millions of historical images contributed by cultural heritage institutions. Alongside the images, users generate “social metadata” such as tags, comments, and curated galleries. This community-produced information often contains contextual knowledge (identifications, places, events, interpretations) that is missing from institutional catalogues.

Recent image-to-text AI models can automatically generate descriptions, but these typically rely only on visual input and overlook existing social context. This project explores whether combining visual models with social metadata can lead to richer, more accurate descriptions of archival images.

Objective

The objective of the project is to develop and evaluate prototype methods for social metadata-enhanced description of archival images. 

Research Questions

  • How do current image description models perform on historical photographs from Flickr Commons?
  • What types of contextual information appear in social metadata but are missing from model outputs?
  • Can social metadata be used to improve the accuracy, specificity, or cultural relevance of generated descriptions?
  • What are the limitations or risks (e.g., bias, misinformation) of incorporating community-generated metadata?

Dataset

While Flickr contains tens of billions of user-uploaded images, the project will focus on the Flickr Commons collection for this bounded study. This collection offers several advantages:

  • Pre-vetted content: images have been curated by trusted institutional partners of the Commons, reducing the risk of encountering harmful material
  • All images are designated with No Known Copyright Restrictions
  • Many images contain social metadata accumulated over years of community engagement on Flickr

Main Steps

  • Literature and tool review: survey existing image-to-text systems and prior work on social metadata in GLAM collections (Flickr has already done studies).
  • Dataset preparation: select a representative subset of Flickr Commons images and collect associated social metadat.
  • Baseline evaluation: generate descriptions using state-of-the-art models and assess their performance.
  • Metadata integration: design and implement methods to incorporate tags, comments, or galleries into the description process.
  • Quantitative + qualitative evaluation:  compare baseline and enhanced descriptions with respect to completeness, accuracy, and archival usefulness.
  • Documentation and project report: write up findings and prepare recommendations for cultural heritage applications.

References

Recent archival initiatives have included: FLAME (2024-25), PAAG (2023-24), Harvard Art Museums AI Explorer (2016-present), Rijksmuseum x Microsoft Azure (2023), Heritage Connector (2020-21) and SherlockNet (2016). These examples show how generated descriptions can be beneficial to improving accessibility and discoverability in archives and collections management. 

Requirements

  • Experience with Python and working with large multimodal APIs
  • Familiarity with machine learning and multimodal AI models
  • Ability to design and evaluate experiments
  • Interest in digital cultural heritage, archives, or social metadata
  • Awareness of ethical considerations in AI and GLAM contexts

Fall 2025

Type: MSc (12 ECTS) semester project 
Sections: Digital Humanities, Data Science, Computer Science
Supervisor: Hamest Tamrazyan, Emanuela Boros
Number of students: 1

Context

Large Language Models (LLMs) exhibit significant challenges in capturing culture-specific meaning, especially for low-resource languages like Armenian or Ukrainian. Research shows that LLMs are not individuals but rather superpositions of cultural perspectives , and their outputs risk cultural erasure by oversimplifying or omitting diverse cultural realities.

Moreover, while methods like CultureLLM attempt to incorporate cultural diversity through data augmentation, a gap remains in fine-grained annotation of psychosocial dimensions of meaning within bilingual corpora.

This project aims to create structured, culturally aware annotations to support both the evaluation and improvement of LLMs and MT systems.

Objective

  • Develop a bilingual dataset (e.g., Armenian-English) focused on culturally embedded expressions.
  • Apply the Psychosocial Categorisation Model (PSCM) to annotate literal, categorical, emotional, and contextual meanings.
  • Investigate statistical signals (e.g., valence shifts, collocation patterns) that identify culture-loaded expressions.
  • Evaluate LLM/MT performance changes with exposure to culturally annotated data.
  • Contribute to the broader effort of resisting cultural erasure in AI systems.

Research Questions

  • What linguistic features statistically signal cultural and semantic density across bilingual corpora?
  • How can PSCM-based annotation systematically capture culture-specific meanings beyond literal translation?
  • Can embedding-based, statistical, and content analysis methods automatically assist in selecting candidates for cultural annotation?
  • Does exposure to PSCM-annotated data improve LLM/MT outputs for low-resource languages, or explain unexpected failures (e.g., perspective shifts)?

Main Steps

  • Literature review: Cognitive semantics, cultural linguistics, LLM evaluation, cultural bias in AI .
  • Data selection: Choose human- and machine-translated bilingual texts rich in cultural material (idioms, folklore, social discourse).
  • Statistical analysis:
    • Valence and emotional scoring.
    • Collocation strength and frequency shifts.
    • Semantic clustering (using embeddings).
  • Candidate selection: Identify units for PSCM annotation using statistical signals and manual verification.
  • PSCM schema design: Define annotation guidelines for literal, categorical, emotional, and contextual levels.
  • Manual annotation: Apply the PSCM schema to selected data; refine based on pilot annotations.
  • Similarity and divergence analysis: Use embedding-based methods to measure shifts between human, machine, and culturally annotated data.
  • LLM/MT evaluation:
    • Compare model outputs with baseline vs. PSCM-enriched prompts.
    • Analyse unexpected perspective shifts and cultural omissions.
    • Analysis of cultural meaning retention: Interpret how models succeed or fail to represent cultural semantics.
  • Reporting: Deliver annotated dataset, analysis results, and recommendations for culturally-aware NLP development.

References

Requirements

  • Background in NLP, data science, or computational linguistics.
  • Skills in Python and common NLP libraries (spaCy, NLTK, sklearn, HuggingFace).
  • Preferably knowledge in LLM interpretability.
  • Knowledge of basic annotation practices; familiarity with tools like Prodigy, Doccano, or custom scripts.
  • Understanding of cultural linguistics, cognitive semantics, or interest in psycholinguistics.
  • (Optional) Knowledge of Armenian or Ukrainian — otherwise translation and interpretive support will be provided.

Type: MSc (12 ECTS) Semester project 
Sections: Digital Humanities, Data Science, Computer Science
Supervisor: Hamest Tamrazyan, Emanuela Boros
Number of students: 2

Context

This project aims to explore, retrieve, and analyze data from various sources to monitor and understand Russia’s manipulation and weaponisation of cultural heritage in Armenia. By employing data analysis techniques, this project seeks to document and provide insights into these actions, contributing to the preservation of cultural heritage and supporting international awareness and policy-making.

Objective and (Possible) Main Steps:

One can choose to analyse Wikipedia to:

  • Track Editing Histories: The revision history of contentious articles may reveal politically or ideologically motivated edits (i.e., articles about Armenian cities or cultural artifacts)
  • Automated Page Tracking: Use tools like Wikimedia’s API to monitor changes in articles about cultural heritage in real-time.
  • Cross-Check Narratives: Compare Wikipedia content with scholarly sources and publications from multiple perspectives.
  • Investigate Talk Pages: The discussions on an article’s talk page often reveal disputes and biases.
  • Web Scraping for Content and Metadata: Use scraping libraries like BeautifulSoup or scrapy to collect article text, editor information, and metadata not available through the API.
  • Revision Analysis:

    • Compare successive revisions of articles using diff algorithms (e.g., difflib) to detect content additions, deletions, or modifications.

    • Highlight changes in sentiment, bias, or framing.

  • Semantic Page Selection: Employ embedding models like Alibaba-NLP/gte-multilingual to identify articles with semantic relevance to “cultural heritage” or “cultural manipulation.”

Requirements: Excellent Python knowledge, scraping, large language models knowledge

Significance:

This project will contribute to understanding cultural heritage manipulation in Armenia, providing valuable insights and digital resources for future academic and cultural research. It will also contribute to the creation of strategies to protect cultural heritage in conflict zones. The methodologies developed in this project can also be applied to other conflict areas, enhancing global efforts to safeguard cultural heritage.

Taken

Type: MSc (12 ECTS) semester project 
Sections: Digital Humanities, Data Science, Computer Science
Supervisor: Hamest Tamrazyan, Emanuela Boros
Number of students: 1

Context

The Corpus of Armenian Inscriptions is a foundational printed resource documenting Armenian epigraphic heritage. Its digitization and encoding in EpiDoc TEI/XML format will enhance accessibility, interoperability, and preservation. However, manual encoding is time-consuming. This project aims to develop a semi-automated pipeline to extract structured data from the PDF corpus and generate valid EpiDoc TEI/XML files.

Objective

  • To build a computational workflow that extracts relevant metadata and texts from the Armenian inscriptions corpus PDF and produces EpiDoc-compliant TEI/XML files according to EpiDoc guidelines and schemas.

Research Questions

  • How can textual and metadata content be automatically extracted from a scanned or born-digital PDF of Armenian inscriptions?
  • What natural language processing or rule-based techniques are effective for identifying epigraphic metadata in Armenian?
  • How can the extracted information be mapped to the EpiDoc TEI/XML schema to produce valid, reusable digital editions?
  • What are the limitations and accuracy challenges posed by OCR and automated extraction in this context?

Main Steps

  • Analyze the PDF corpus to determine the nature of its content (text layer vs. scanned images).
  • Apply OCR (using Armenian OCR tools like Calfa hye-tesseract) if necessary, and clean the extracted text.
  • Segment the text into individual inscription entries and identify key metadata fields (e.g., provenance, material, date, language, transcription).
  • Develop rule-based or NLP methods to extract structured information.
  • Design and implement a script to generate valid EpiDoc TEI/XML files from the extracted data, following EpiDoc schema and templates.
  • Validate generated XML files against the EpiDoc schema.
  • Document the process, challenges, and provide sample output files.

References

  • EpiDoc Guidelines and Schema: https://epidoc.sf.net/
  • Calfa hye-tesseract OCR: https://github.com/Calfa/hye-tesseract
  • TEI Consortium, TEI Guidelines: https://tei-c.org/release/doc/tei-p5-doc/en/
  • Relevant publication on Armenian epigraphy (http://serials.flib.sci.am/openreader/vimagrutyun_5/book/content.html)
  • Python libraries: pdfminer.six, lxml, spaCy (or other NLP tools)

Requirements

  • Proficiency in Python programming
  • Basic knowledge of XML, TEI.

Type: Master thesis project (30 ECTS)
Sections: Computer Science, Data Science, Digital Humanities
Supervisors: Frédéric Kaplan, Emanuela Boros, Maud Ehrmann
Number of students: 1

Context: The Impresso Project enriches large collections of radio and newspaper archives using image and text processing techniques. Among many processings that are applied to a dataset of over 130 digitised historical newspapers containing millions of pages and images (15B tokens), traditional named entity recognition and linking is applied. While location names are recognised and linked to Wikidata, a crucial dimension that is still missing is to accurately georeference relevant location names, in order to enable the integration of the spatial dimension to the temporal one.

Objective. This project aims to scale multilingual location detection and georeferencing across the Impresso corpus, with a particular focus on sub-city levels.

Main Steps

  • Familiarising yourself with the Impresso project and data.
  • Literature Review: Explore existing research on multilingual location detection and georeferencing.
  • Data Analysis: Examine location names already recognised in the corpus, analyzing their statistical profiles, common errors, and areas for improvement, particularly at the sub-city level. Additionally, identify which location entities in historical newspapers are most relevant for mapping purposes.
  • System Implementation: Develop or adapt a system for fine-grained location name recognition and linking. Various directions could be followed.
  • Relevance Filtering: Design a method to determine which recognised place names are meaningful and should be georeferenced.
  • Evaluation: Assess the system’s performance using appropriate metrics and benchmarks.
  • Application: Deploy the system on the entire Impresso corpus.

The student can leverage tools such as the T-Res library, DeezyMatch, the HIPE entity evaluation pipeline, and the TopRes19th dataset.

Requirements: Knowledge of machine learning (ML) and deep learning (DL), familiarity with natural language processing, proficiency in Python, experience with a DL framework (preferably Pytorch), interest in historical data.

A few references:

  • Ardanuy, M. C., Nanni, F., Beelen, K., & Hare, L. (2023). The past is a foreign place: Improving toponym linking for historical newspapers. Proceedings http://ceur-ws. org ISSN, 1613, 0073.
  • Meijers, E., & Peris, A. (2019). Using toponym co-occurrences to measure relationships between places: Review, application and evaluation. International Journal of Urban Sciences, 23(2), 246-268.

Type: BA or MSc (8-12 ECTS) semester project
Sections: Computer Science, Data Science, Digital Humanities
Supervisors: Maud Ehrmann, Pauline Conti, Emanuela Boros
Number of students: 1

Context: The Impresso Project features a dataset of around 135 digitised historical newspapers containing approximately 4 million images. These images have no labels, and only 10% of them have a caption, two aspects that hinder their retrieval.

Objective. The objective of this project is to generate informative and accurate textual descriptions (ideally in English, French and German) of these images and determine an evaluation method to assess the quality of the descriptions. This process is reffered to as image captioning. Descriptions can have the format of captions or be slightly longer.

Challenges: 1) images extracted from historical newspapers spanning 200 years, being therefore of very different quality 2) what images represent is very diverse, and we would like high-quality descriptions throughout the diversity of topics and styles.

Main steps: The project could follow the following steps:  

  1. Investigating which large vision-language models are good candidates for the task. Examples include Flamingo, Paligemma (Google), BLIP (Salesforce), CLIP-VIT (OpenAI), GIT, Florence, and Phi (Microsoft), LLama3v; 
  2. Building a test dataset from images (of a certain type) that already have a caption and from images that do not have a caption;
  3. Determining an evaluation method. Basically, what do we define as a good description? Criteria could be: language correctness; accuracy, i.e. the caption provides a correct description of the image; informativeness, i.e. the caption provides elements of information that are useful and interesting; ‘texture’/tone: the tone of the caption is adapted to the image; for those images that already have captions: how close are the generated ones to the original ones.
  4. Determining a baseline and a series of experiments that make sense given the context of the project and the historical nature of the material
  5. Analysing the results and drawing conclusions on what works best in which setting for this type of material.
  6. If the project is accepted as a Msc semester project, an additional step for finetuning (training) an LLM for image captioning might be considered.

Additional material:

  • a dataset of 7200 images annotated with their types (drawing, map, photo, graphs, etc.)
  • for each image, pre-computed embeddings from four different models.

Requirements: Knowledge of machine learning (ML) and deep learning (DL), familiarity with natural language processing, proficiency in Python, experience with a DL framework (preferably Pytorch).

Type: MSc (12 ECTS) semester project
Sections: Computer Science, Data Science, Digital Humanities
Supervisors: Alexander Rusnak
Number of students: 1

Context: There is a current paradigm in 3D scene understanding leveraging vision models, particularly ones that produce semantic vectors like CLIP, on the images that are used to create 3D scans before projecting the derived features onto the 3D scene structure.

However, for a lot of “in the wild” 3D scenes or data, the images from which the scene was derived are unavailable – but it is key to creating a foundation scale model to be able to use these 3D scenes for training data. In order to use this same sort of projection approach on a point cloud (or mesh), you have to take synthetic images of the 3D model. But for many scenes, especially where the points are a bit more sparse, the images look like pictures of a point cloud and not a totally realistic image.  So when applying a VLM which has been trained only on natural images, the semantic vector reflects this (i.e. with the features from a picture of a point cloud of a chair, the closest text in the embedding space would be  “a point cloud of a chair” but for optimal projection performance the vector should just represent “a chair”).

Objective: Evaluate the performance differential between real and synthetic images of 3D scenes with base VLMs and then finetune the VLMs to improve their performance on synthetic images. 

Research Questions:

  • Which VLMs are most effective out of the box on synthetic images? 
  • How much can we improve their performance on synthetic scenes? 
  • What is the best way to develop pixel-level features (cropping around segments or using a default pixelwise encoder)?

Main Steps:

  1. Take some 3D datasets which also have the natural images and the poses of the images.
  2. Take synthetic images of the point cloud from the same poses as the natural images.
  3. Finetune some VLM(s) using these paired natural and synthetic images to minimize the distance between the two embeddings.
  4. Publish a paper about the results / open-source the best model on huggingface

References:

Requirements: Python, Pytorch, Open3D, etc

Type: MSc (12 ECTS) semester project
Sections: Computer Science, Data Science, Digital Humanities
Supervisors: Alexander Rusnak
Number of students: 1

Context: 

There is quite a bit of research around image aesthetic assessment (IAA) but not as much around assessment of 3D data even though it likewise often represents data which is often highly aesthetically oriented (i.e. sculptures, cathedrals, civic buildings, etc) and the aesthetic quality of the built landscape seems to have a strong effect on the wellbeing of the inhabitants.

The goal of the project would be to take a 3D dataset and build a pseudo-labeling pipeline to create aesthetic labels for the various points, and then train a model to predict these labels just from the point cloud structure. The labeling pipeline would likely capture images of the 3D scene before using an IAA model / vision foundation model to produce pixel wise labels before projecting them back into the 3D scene.

Objective: Develop a model for the 3D aesthetic assessment of scenes and objects at point or superpoint level granularity.

Research Questions:

  • How capable are projected VLM features of capturing abstract categories like beauty in 3D scenes? 
  • Are IAA more accurate for this task than generalists VLMs? 
  • What is the best way of creating pixelwise features from whole image features?
  • How effective is distillation of these features from a model which evaluates 3D structure? Does this model work truly out of sample i.e. does a model trained on sculptures works on buildings and vice versa.

Main Steps:

  1. Literature review on IAA and 3DAA
  2. Project features from VLMs and bespoke IAA models into 3D scenes and evaluate their agreement. 
  3. Distill a model for predicting these features.
  4. Test this model on out of domain 3D data to applicability of 3D structure aesthetics across various objects / scenes.

References:

Requirements: Python, Pytorch, Open3D, etc

Type: MSc (12 ECTS) semester project
Sections: Computer Science, Data Science, Digital Humanities
Supervisors: Alexander Rusnak
Number of students: 1

Context: Recent advances in mechanistic interpretability have substantially increased the clarity of the internal reasoning of large transformer based neural networks, allowing researchers to disentangle steps of conceptual reasoning at high levels of abstraction. In order to unlock the feature graphs which make this sort of analysis possible, it is necessary to replace certain layers within the LLM with a connected transcoder architecture and train this new system to replicate the behaviour of the LLM while attached to the remaining frozen layers. This novel approach has thus far not been applied to many different models, or in a multimodal context.

Objective: Train a transcoder replacement model to replicate the behaviour of a vision-language model, then use it to perform various forms of analysis on the internal reasoning of the VLM.

Research Questions:

  • Is it possible to train a transcoder to replicate the behaviour of a VLM?
  • Does the VLM encode modality agnostic representations of concepts in an analogous way to multilingual conceptual features?
  • Can we use the transcoder features to determine logical reasoning steps on the analysis of images, and in particular, images of text heavy documents?

Main Steps:

  1. Literature review and model selection
  2. Implement layer replacement with transcoder within candidate model or models.
  3. Run training of the transcoder replacement model.
  4. Use modified model to identify modality-unified features if possible. 
  5. Use modified model to test reasoning steps on text heavy documents.

References:

Requirements: Python, Pytorch, etc

Type: MSc (12 ECTS) semester project 
Sections: Digital Humanities, Data Science, Computer Science
Supervisor: Hamest Tamrazyan, Emanuela Boros
Number of students: 1–2

Context

This project aims to explore, retrieve, and analyse data from the resources, focusing on a select collection of books related to epigraphy and cultural heritage. The primary objective is to gain insights into Armenian epigraphic and cultural heritage through detailed data analysis, term extraction, and database management.

Objective

This project aims to deepen the knowledge of the architectural and epigraphic significance of the church, explore innovative techniques for digitizing and visualizing cultural heritage, and contribute to the preservation and accessibility of Armenian inscriptions in Nagorno-Karabakh.

Objectives and Main steps

  1. Data Retrieval: Collect and aggregate data from academic resources, specifically targeting books and resources about epigraphy and cultural heritage.
  2. Data Cleaning and Formatting: Implement data preprocessing techniques to ensure data quality. This includes removing irrelevant or corrupt data, handling missing values, and standardizing formats.
  3. Database Setup: Design and implement a database to store and manage retrieved data efficiently. The database should allow easy access and manipulation of the data for analysis.
  4. Term Extraction and Analysis: Employ natural language processing (NLP) techniques to extract key terms, concepts, and thematic elements from the texts. This will help us understand the predominant themes and patterns in Ukrainian epigraphy and cultural heritage.

Requirements

  • Proficiency in Python, knowledge of NLP techniques.

Significance

This project will contribute to the understanding of Armenian’s rich cultural heritage. It will provide valuable digital resources for future academic and cultural research in this field.

Spring 2025

Taken

Type: MSc (12 ECTS) Semester project 
Sections: Computer Science, Data Science, Digital Humanities
Supervisor: Alexander Rusnak
Number of students: 1

Context: The field of AI ethics has become increasingly relevant as language models have proliferated into the public sphere. 

Objective: Find novel ways of quantifying normative ethics and persistence of ethical frameworks across various scenarios. 

Type of project: Semester

Supervisor: Maud Ehrmann

Objective: Given a large archive of historical newspapers (cf. the impresso project) containing both text and image material, the objective of this project is to :

1/ Learn a model for binary image classification: map vs. non-map.
This first step will consist in:

  • the annotation of a small training set (this step is best done in collaboration with project on image classification);
  • the training of a model by fine-tuning an existing visual model;
  • the evaluation of the said model.

UPDATE: an annotated dataset already exists.

2/ Learn a model for map classification (which country or region of the world is represented)

  • first exploration and qualification of map types in the corpus.
  • building of a training set, prob. with external sources
  • the training of a model by fine-tuning an existing visual model;
  • the evaluation of the said model.

Required skills:

  • basic knowledge in computer vision
  • ideally experience with PyTorch

Type: MSc (12 ECTS), BA (8ECST) Semester project 
Sections: Computer Science, Data Science, Digital Humanities
Supervisor: Paul Guhennec
Number of students: 1–2 (or more – together or separate)

Type: MSc (12 ECTS), BA (8ECST) Semester project 
Sections: Computer Science, Data Science, Digital Humanities
Supervisor: Paul Guhennec
Number of students: 1–2 (or more – together or separate)

Type: MSc (12 ECTS), BA (8ECST) Semester project 
Sections: Computer Science, Data Science, Digital Humanities
Supervisor: Paul Guhennec
Number of students: 1–2 (or more – together or separate)

Type: MSc (12 ECTS), BA (8ECST) Semester project 
Sections: Computer Science, Data Science, Digital Humanities
Supervisor: Paul Guhennec
Number of students: 1–2 (or more – together or separate)

Type: MSc (12 ECTS) Semester project 
Sections: Digital Humanities, Data Science, Computer Science
Supervisor: Hamest Tamrazyan, Emanuela Boros
Number of students: 1–2

Context

This project aims to explore, retrieve, and analyze data from various sources to monitor and understand Russia’s manipulation and weaponisation of cultural heritage in Ukraine. By employing data analysis techniques, this project seeks to document and provide insights into these actions, contributing to the preservation of cultural heritage and supporting international awareness and policy-making.

Objective and (Possible) Main Steps:

One can choose to analyse Wikipedia to:

  • Track Editing Histories: The revision history of contentious articles may reveal politically or ideologically motivated edits (i.e., articles about Ukraine cities or cultural artifacts)
  • Automated Page Tracking: Use tools like Wikimedia’s API to monitor changes in articles about cultural heritage in real-time.
  • Cross-Check Narratives: Compare Wikipedia content with scholarly sources and publications from multiple perspectives.
  • Investigate Talk Pages: The discussions on an article’s talk page often reveal disputes and biases.
  • Web Scraping for Content and Metadata: Use scraping libraries like BeautifulSoup or scrapy to collect article text, editor information, and metadata not available through the API.
  • Revision Analysis:

    • Compare successive revisions of articles using diff algorithms (e.g., difflib) to detect content additions, deletions, or modifications.

    • Highlight changes in sentiment, bias, or framing.

  • Semantic Page Selection: Employ embedding models like Alibaba-NLP/gte-multilingual to identify articles with semantic relevance to “cultural heritage” or “cultural manipulation.”

Requirements: Excellent Python knowledge, scraping, large language models knowledge

Significance:

This project will contribute to understanding cultural heritage manipulation in Ukraine, providing valuable insights and digital resources for future academic and cultural research. It will also contribute to the creation of strategies to protect cultural heritage in conflict zones. The methodologies developed in this project can also be applied to other conflict areas, enhancing global efforts to safeguard cultural heritage.

Fall 2024

Taken

Type: MA (12 ECTS) Research project
Sections: Computer Science, Data Science, Digital Humanities
Supervisors: Tristan Karch
Number of students: 1

Context: 

The study of urban history is a complex and multidimensional field that involves analyzing various types of historical data, including cadastre (land registry) records. Traditionally, this process has been manual and time-consuming. However, with the advent of Large Language Models (LLMs) and their ability to process and analyze vast amounts of data, there is an opportunity to automate and enhance historical discoveries. Identifying divergences between present and past data is a critical starting point for many historical investigations, as it allows researchers to uncover patterns, transformations, and anomalies in the urban landscape over time.

Cadastre Data:

Cadastre data typically includes detailed information about land ownership, property boundaries, land use, and the value of properties. This data is crucial for understanding the historical layout and development of urban areas. Importantly, all data points in cadastre records are geolocalized, which facilitates direct comparison with today’s data from sources like OpenStreetMap.

Objective:

The primary objective of this project is to develop an automated system that leverages LLM agents to compare historical cadastre data with present-day data. The LLM agent would rely on a coding assistant as in [1] to efficiently convert hypotheses in natural language into python programs that efficiently make operations on tabular data.

Main Steps: 

  1. Data Collection: Gather historical cadastre data and current urban data from various sources.
  2. Preprocessing: Clean and preprocess the collected data to ensure compatibility and accuracy.
  3. LLM Integration: Integrate LLM agents to analyze and compare the historical and contemporary datasets.
  4. Analysis: Conduct a detailed analysis to identify significant changes and patterns in the urban landscape.

Additional Comparisons:

In addition to comparing cadastre data with present-day geolocalized data from OpenStreetMap, other comparisons can be envisioned. For instance, leveraging genealogical databases or other open registers from today can provide further insights into the socio-economic transformations and population dynamics over time.

References:

[1] Majumder, Bodhisattwa Prasad, Harshit Surana, Dhruv Agarwal, Sanchaita Hazra, Ashish Sabharwal, and Peter Clark. “Data-Driven Discovery with Large Generative Models.” arXiv, February 21, 2024. http://arxiv.org/abs/2402.13610.

Requirements:

Proficiency in data science and machine learning. Familiarity with LLMs and natural language processing. Experience with Langchain, Huggingface or Autogen is a plus.

Type: Master or Bachelor research project (12/8 ECTS)
Sections: Data Science
Supervisors: Pauline Conti, Maud Ehrmann 
Number of students: 1

Ideal as an optional semester project for a data science student.

Context: The Impresso project semantically enriches large collections of radio and newspaper archives by applying image and text processing techniques. A complex pipeline of data preparation and processing steps is applied to millions of content elements, creating and manipulating millions of data points.

Objective: The aim of this project is to implement a data visualisation dashboard to enable monitoring and quality control of the different data and their processing steps. Based on different sources of information, i.e. data processing manifests, inventories and statistics, the dashboard should provide an overview of what data is at what stage of the pipeline, allow a comparative view of different processing stages and support general understanding.

The solution adopted should ideally be modular and lightweight, and will ultimately be deployed online to allow everyone from the project (and perhaps more) to follow the data processing pipeline.


Steps:

  • Understanding of the different Impresso processes and data, and the associated visualisation needs
  • Detailed review of existing open-source dashboard data visualisation tools 
  • Implementation of tools, customisation to meet needs, visualisation proposals based on opportunities
  • Test/revision loop
  • Online deployment

Requirements:

Background in data science and data visualisation, basics of software engineering, good knowledge of Python, interest in data management.

Organisation of work

  • Weekly meeting with supervisor(s)
  • The student is asked to submit a detailed project plan (envisaged steps, milestones) by the end of week 2.
  • The student is advised to document his/her work in a logbook regularly and to document updates on progress, potential questions or problems in the logbook before the weekly meeting (at least 4 hours before).
  • A slack channel is used for communication outside the weekly meeting.
  • The student is advised to start his/her project report between 3 and 2 weeks before the end of the project. Report on overleaf using the EPFL template.

Type: Master thesis project (30 ECTS)
Sections: Computer Science, Data Science, Digital Humanities
Supervisors: Pauline Conti, Emanuela Boros, Maud Ehrmann
Number of students: 1

Context: The Impresso project comprises a dataset of 130 digitised historical newspapers, totalling approximately 7.5 million pages. About half of these newspapers were digitised with only optical character recognition (OCR), while the rest also underwent optical layout recognition (OLR), separating text zones (lines, paragraphs, margins) and organising and labelling them into logical units corresponding to the various areas of the page (articles, headlines, section heads, tables, footnotes, etc).

For newspapers lacking OLR, the text from different content units is not differentiated, which negatively impacts the performance of NLP tools (and often compromises their relevance when applied to mixed contents). Identifying the bounding regions of the various content areas on newspaper pages could help us disentangle their respective texts and allow for separate processing.

Source: Luxemburger Wort – June 1st 1950, page 6. An example of a newspaper page from the Impresso corpus, which had OLR and where the various content areas identified are visible with blue squares.

The objective of the project is to investigate the ability of Large Vision Models (LVMs) tp interpret the physical layout of digitised print documents, in this case historical newspaper facsimiles. Specifically, the project aims to test and evaluate different models at segmenting and labelling logical units on pages such as text-paragraph, title, subtitle, table and image (either semantic or instance segmentation). The project will benefit from existing OLRed data from the Impresso corpus, which could be sampled to create a training set. The project will address, among others, the following research questions:

  • Can LVMs (multimodal, vision-only) accurately recognise the layout of historical newspaper pages, and which approach is best suited?
  • Can the identified approach be generalised to a large-scale dataset spanning around 300 years with significant variation in layout?

Main Steps: 

  • Familiarise with the Impresso project and data to understand the specific needs
  • Review literature on document layout recognition and instance segmentation with the goal of identifying most promising approaches and recent models.
  • Programmatically create a dataset based on existing OLR data, showcasing layout variety across newspaper titles and over time.
  • Explore, apply, and evaluate selected multimodal and/or large vision model(s)
  • Depending on results and progress, potentially explore post-processing to order or group regions corresponding to the same articles or contents.

Requirements: Knowledge of machine learning (ML) and deep learning (DL), familiarity with computer vision, proficiency in Python, experience with a DL framework (preferably Pytorch), interest in historical data.

References: 

And also:

Type: MA (12 ECTS) or BA (8 ECTS) Research project
Sections: Computer Science, Data Science, Digital Humanities
Supervisors: Tristan Karch
Number of students: 1–2

Context: 

Large Language Models (LLMs) have revolutionised how we interact with and extract information from vast textual datasets. These models have been trained on extensive corpora, incorporating a broad spectrum of knowledge. However, a significant challenge arises in determining whether a given piece of text presents new information or reflects content that the model has already encountered during its training. 

Evaluating the novelty of texts is crucial for applications in historical research. As the volume of historical literature grows, particularly with the digitization of vast archives, historians face the challenge of navigating and synthesising information from these extensive datasets. 

Competent LLMs, adept at recognizing and integrating new knowledge, can significantly enhance this process. They would allow historians to uncover previously unseen patterns, connections, and insights, leading to groundbreaking historical discoveries and more robust applications in the digital humanities. 

Objective:

This project aims to develop algorithms that can effectively evaluate whether textual sources represent new pieces of knowledge that were never distilled in open-source LLMs during pre-training.

Research Questions:

  • What algorithms can be developed to assess the novelty of texts with respect to LLM training data?
  • How can these algorithms be combined with standard retrieval approaches [1] to improve them in the domain of interest. 

Main Steps:

  • Literature Review: Conduct a comprehensive review of existing approaches to novelty detection [2,3,4], hallucination detection [5]  and knowledge evaluation [6] in the context of LLMs.
  • Problem Definition: Formally define knowledge in the context of textual data: information (content) vs novel pattern of language (form)
  • Data Preparation: 
    • Novel data selection (Sources curated by EPFL – Secondary sources about Venice, EPFL thesis or new data)
    • Standard statistical analysis of data (unsupervised NLP technics)
  • Algorithm Development: Develop algorithms for novelty detection that may include:
    • Statistical comparison methods.
    • Embedding-based similarity measures.
    • Anomaly detection techniques.
    • Token / phrase-level /  chunk  analysis.
  • Testing and Evaluation: Test the developed algorithms using various datasets to assess their accuracy and effectiveness in identifying novel texts. This includes:
    • Benchmarking against known LLM training data.
    • Evaluating performance across different genres and languages.

References

[1] Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, Douwe Kiela. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks –  NeurIPS 2020.

[2] Shi, Weijia, Anirudh Ajith, Mengzhou Xia, Yangsibo Huang, Daogao Liu, Terra Blevins, Danqi Chen, and Luke Zettlemoyer. “Detecting Pretraining Data from Large Language Models.” arXiv, March 9, 2024. http://arxiv.org/abs/2310.16789.

[3] Golchin, Shahriar, and Mihai Surdeanu. “Data Contamination Quiz: A Tool to Detect and Estimate Contamination in Large Language Models.” arXiv, February 10, 2024. http://arxiv.org/abs/2311.06233.

[4] Hartmann, Valentin, Anshuman Suri, Vincent Bindschaedler, David Evans, Shruti Tople, and Robert West. “SoK: Memorization in General-Purpose Large Language Models.” arXiv, October 24, 2023. https://doi.org/10.48550/arXiv.2310.18362.

[5] Farquhar, Sebastian, Jannik Kossen, Lorenz Kuhn, and Yarin Gal. “Detecting Hallucinations in Large Language Models Using Semantic Entropy.” Nature 630, no. 8017 (June 2024): 625–30. https://doi.org/10.1038/s41586-024-07421-0.

[6] Wang, Cunxiang, Sirui Cheng, Qipeng Guo, Yuanhao Yue, Bowen Ding, Zhikun Xu, Yidong Wang, Xiangkun Hu, Zheng Zhang, and Yue Zhang. “Evaluating Open-QA Evaluation.” arXiv, October 23, 2023. https://doi.org/10.48550/arXiv.2305.12421.

Requirements: Good programming skills, a strong interest for LLM research, experience with LLMs is a plus.

Type: MA Research project or MA thesis
Sections: Computer Science, Data Science, Digital Humanities
Supervisors: Pauline Conti, Emanuela Boros, Maud Ehrmann
Number of students: 1

Excerpt of the Swiss Press Bibliography of Fritz Blaser (categories on top, entries at the bottom)


The Impresso project applies natural language and computer vision processing techniques to enrich large collections of radio and newspaper archives and develops new ways for historians to explore and use them. Exploring, analysing and interpreting historical media sources and their enrichment is only possible with contextual information about the sources themselves (e.g. what is the political orientation of a newspaper), and the processes applied to them (what is the accuracy of the tools that produced this or that enrichment).

Information on newspapers, or metadata, already exists but can always be supplemented. In this respect, the “Swiss Press Bibliography“, published by Fritz Blaser in 1956, is a treasure trove of information on the origins and history of Swiss newspapers. This bibliography documents 483 and around a thousand periodicals published in Switzerland between 1803 and 1958 and documents them in great detail according to a given template – a database on paper.

The objective of the project is to extract the semi-structured information from Blaser’s newspaper bibliography (PDF files) and build a lightweight database (possibly in JSON only, or graph DB).

The extracted information will be used to

  • document the newspapers present in the Impresso web application;
  • support the study of the newspaper ecosystem in Switzerland at that time, e.g. by studying clusters of publications by political orientation over time, tracking publishers or editors, etc.

Steps 

  • Review tools that can be used to correct/redo the OCR of PDF files and select one;
  • Define a data model based on the information contained in the bibliography;
  • Extract and systematically store the information
  • Devise a way of assessing the quality of the extraction process.
  • If time permits, carry out an initial analysis of the database created.

If taken as a Master project:

  • Additional steps:
    • Perform named entity recognition and linking on the information present in some descriptive fields
    • Conduct a first analysis of the database, e.g.
      • Map printing locations  of newspapers in Switzerland, and their evolution through time
      • Create a network of main actors (editors, publishers)
      • …and more, this is a very rich source.
  • Similar sources at the European level could be integrated.

This project will be done in collaboration with researchers from the History Department of UNIL, members of the Impresso project.

Requirements

Good knowledge of Python, basics of software and data engineering, interest in historical data. Medium to good knowledge of French or German is required.

Organisation of work

  • Weekly meeting with supervisor(s)
  • The student is asked to submit a detailed project plan (envisaged steps, milestones) by the end of week 2.
  • The student is advised to document his/her work in a logbook regularly. Updates on progress, potential questions or problems will be listed in the logbook before the weekly meeting (at least 4 hours before).
  • A Slack channel is used for communication outside the weekly meeting.
  • The student is advised to start his/her project report between 3 and 2 weeks before the end of the project. Report on overleaf using the EPFL template.

Type: MSc (12 ECTS), BA (8ECST) Research project 
Sections: Computer Science, Data Science, Digital Humanities
Supervisors: Isabella di Lenardo, Raimund Schnürer 
Number of students: 1–2

Context: 

In recent years, various features have been extracted from historical maps thanks to advancements in machine learning. While the content of maps is relatively well studied, elements around the maps still deserve some attention. These elements are used, amongst others, for decoration (e.g. ornamentation, cartouches), orientation (e.g. scale bar, wind rose, north arrow), illustration (e.g. heraldic, figures, landscape scenes), and description (e.g. title, explanations, legend). The analysis of the style and arrangement of these elements will give valuable hints about the cartographer’s background.

Objective:

In this project, map layout elements shall be analysed in depth using a given dataset of 400.000 historical maps.

Main Steps:

  • Review literature about extracting map layout elements
  • Detect map layout elements in historical maps using artificial neural networks (e.g. segmentation)
  • Find similar elements between maps (e.g. by t-SNE)
  • Identify clusters among authors, between different regions and time periods
  • Visualize these connections

Research Questions:

  • How accurately can the elements be detected on historic maps?
  • Which visual properties are suited to find similarities between the elements?
  • Which connections exist between different maps?

References:

Requirements: 

Good programming skills, familiarity with machine learning, interest in historical maps

Type: MSc research project (12 ECTS) or MSc project
Sections: Data Science, Computer Science, Digital Humanities
Supervisors: Emanuela Boros, Maud Ehrmann 
Number of students: 1

Figure: Historical Instruction Mining for (finetuning) Large Language Models (Generated with Midjourney)

Context

Historical collections present multiple challenges that depend either on the quality of digitization, the need to handle documents deteriorated by the effect of time, the poor quality printing materials or inaccurate scanning processes such as optical character recognition (OCR) or optical layout recognition (OLR). Moreover, historical collections can pose another challenge due to the fact that documents are distributed over a long enough period of time to be affected by language change. This is especially true in the case of Western European languages, which only acquired their modern spelling standards roughly around the 18th or 19th centuries. At the same time, language models (LMs) have been leveraged for several years now, obtaining state-of-the-art performance in the majority of NLP tasks, by generally being fine-tuned on downstream tasks (such as entity recognition). LLMs or instruction-following models have taken over with relatively new capabilities in solving some of these tasks in a zero-shot manner through prompt engineering. For example, the generative pre-trained transformer (GPT) family of LLMs refers to a series of increasingly powerful and large-scale neural network architectures developed by OpenAI. Starting with GPT, the subsequent models have witnessed substantial growth, such as ChatGPT, GPT-4. These increased sizes allow the models to capture more intricate patterns in the training data, resulting in better performance on various tasks (like acing exams). Nevertheless, they seem to fail in understanding and reasoning when it handles historical documents. This project aims to at building a dataset in a semi-automatic manner for improving the application of LLMs in historical data analysis in the context of impresso – Media Monitoring of the Past II. Beyond Borders: Connecting Historical Newspapers and Radio project.

Research Questions

  • Is it feasible to create a dataset for training LLMs to better comprehend historical documents, using semi-automatic or automatic methods?
  • Can a specialized, resource-efficient LLM effectively process and understand noisy, historical digitized documents?

Objective

To develop an instruction-based dataset to improve LLMs’ capabilities in interpreting historical documents. The focus will be on sourcing and analyzing historical Swiss and Luxembourgish newspapers (spanning 200 years) and other historical collections in ancient Greek or Latin.

Instruction/Prompt: When was the Red Cross founded?

Example Answer: 1864

Instruction / Prompt: Given the following excerpt from a Luxembourgish newspaper from 1919, identify the main event and key figures involved. Excerpt: “En 1919, la Grande-Duchesse Charlotte est montée sur le trône du Luxembourg, succédant à sa sœur, la Grande-Duchesse Marie-Adélaïde, qui avait abdiqué en raison de controverses liées à la Première Guerre mondiale.”

Example Response: Grand Duchess Charlotte and her sister, Grand Duchess Marie-Adélaïde.

Main Steps

  1. Data Curation:
    • Collect OCR-based datasets.
    • Analyze historical newspaper articles to understand common features and challenges.
  2. Dataset Creation:
    • Decide on what type of instruction should be generated and utilise other existing LLMs such as T5, BART, etc., to generate instructions (or questions) from Swiss historical documents, similar to the method presented in ArchivalQA.
    • Train or finetune a LLaMA language model based on this dataset.
  3. Model Training/Fine-Tuning:
    • Train or fine-tune a language model like LLaMA on this dataset.
  4. Evaluation:
    • Assess LLMs’ performance on NLP tasks (NER, EL) using historical documents.
    • Compare models trained on the new dataset with those trained on standard datasets.
    • Employ metrics like accuracy, perplexity, F1 score.

Requirements

  • Proficiency in Python, ideally PyTorch.
  • Strong writing skills.
  • Commitment to the project.

Output

  • Potential publications in NLP and historical document processing.
  • Contribution to advancements in handling historical texts with LLMs.

Deliverables

  • A comprehensive dataset for training LLMs on historical texts.
  • A report or paper detailing the methodology, findings, and implications of the project.

References

Optical character recognition (OCR) or optical character reader is the electronic or mechanical conversion of images of typed, handwritten or printed text into machine-encoded text, whether from a scanned document, a photo of a document, a scene photo or from subtitle text superimposed on an image.

Postponed

Type: MA (12 ECTS), BA (8ECST) Research project
Sections: Computer Science, Data Science, Digital Humanities
Supervisors: Tristan Karch
Number of students: 1–2

Context: 

Language shapes the way we take actions. We use it all the time, to plan our day, organize our work and most importantly, to think about the world. Through language, we not only communicate with others but also internally process and understand our experiences. This ability to think about the world through language is also what enables us to engage in counterfactual reasoning [1], imagining alternative scenarios and outcomes to refine and enhance our experience of the world.

There is a growing body of work that investigates the deep interactions between language and decision-making systems [2]. In this context, large language models (LLMs) are used to design autonomous agents [3,4] that achieve complex tasks involving different reasoning patterns in textual interactive environments [5,6]. Such agents are equipped with different mechanisms such as reflexive [7] and memory [8] modules to continually adapt to new sets of tasks and foster generalisation. These modules rely on efficient prompts that help agents combine their environmental trajectories with their foundational knowledge of the world to solve advanced tasks.

Objective:

The objective of this project is to design and evaluate counterfactual learning mechanisms in LLM agents evolving in textual environments.

Research Questions:

  • Can we design reflexive mechanisms that autonomously generate counterfactuals from behavioral traces of agents evolving in textual environments?
  • What is the effect of counterfactuals on exploration? Adaptation? Generalization?

Main Steps:

  • Interdisciplinary literature review (LLM agents, language and reasoning, counterfactual reasoning);
  • Get familiar with benchmarks (Science world, Alf word, others?);
  • Re-implement baselines: Reflexion [7] and Clin [8];
  • Design counterfactual generation;
  • Derive metrics to analyze the impact of the proposed approach.

Requirements: Good programming skills, Experience working with RL and LLM is a plus.

References:

[1] The Functional Theory of Counterfactual Thinking – K. Epstude and N. Roese, Pers Soc Psychol Rev. 2008 May;12(2):168-92. doi: 10.1177/1088868308316091. PMID: 18453477; PMCID: PMC2408534.

[2] Language and Culture Internalisation for Human-Like Autotelic AI – Cédric Colas, Tristan Karch, Clément Moulin-Frier, Pierre-Yves Oudeyer. Nature Machine Intelligence.

[3] ReAct: Synergizing Reasoning and Acting in Language Models – Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, Yuan Cao, https://arxiv.org/abs/2210.03629

[4] Language Modes are Few-Shot Butlers, Vincent Micheli, Francois Fleuret, https://arxiv.org/abs/2104.07972

[5] ALFWorld: Aligning Text and Embodied Environments for Interactive Learning – Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Côté, Yonatan Bisk, Adam Trischler, Matthew Hausknecht, https://arxiv.org/abs/2010.03768

Oyvind Tafjord, Niket Tandon, Li Zhang, Chris Callison-Burch, Peter Clark, https://arxiv.org/abs/2310.10134

[6] ScienceWorld: Is your Agent Smarter than a 5th Grader? – Ruoyao Wang, Peter Jansen, Marc-Alexandre Côté, Prithviraj Ammanabrolu, https://arxiv.org/abs/2203.07540

[7] Reflexion: Language Agents with Verbal Reinforcement Learning – Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, Shunyu Yao, https://arxiv.org/abs/2303.11366

[8] CLIN: A Continually Learning Language Agent for Rapid Task Adaptation and Generalization – Bodhisattwa Prasad Majumder, Bhavana Dalvi Mishra, Peter Jansen, Oyvind Tafjord, Niket Tandon, Li Zhang, Chris Callison-Burch, Peter Clark. https://arxiv.org/pdf/2310.10134

Spring 2024

Available

Type: MSc Semester project
Sections: Architecture, Digital Humanities, Data Science, Computer Science
Supervisor: Beatrice Vaienti, Hamest Tamrazyan
Keywords: 3D modelling
Number of students: 1–2 (min. 12 ECTS in total)

Context

The Armenian cultural heritage in Artsakh [1] is in danger after explicit threats of irreversible destruction coordinated by Azerbaijan authorities [2]. As part of a series of actions coordinated by the EPFL, the Digital Humanities Institute is currently prototyping methods to offer rapid deployment of DH technology in situations of crisis. As part of these actions, over 200 Armenian inscriptions of Artsakh with essential information such as the language data including diplomatic and interpretive transcriptions, the translation into English, the location of the inscription on the monument (if applicable), geographical and chronological data, the type of monument, and the type of inscription systematize and digitize the inscriptions on the monuments of Armenian cultural heritage in Nagorno-Karabakh. This digitized data will help not only to preserve the invaluable inscriptions but also can be used for further investigations and research.  The aim of this project is to create a 3D model of the church successfully, accurately locate the inscriptions, and contribute to preserving, studying, and promoting Armenian cultural heritage in Nagorno-Karabakh.

Figure: By Julian Nyča – Own work, CC BY-SA 3.0

Research questions

  • How can advanced imaging and 3D modelling technologies be utilized to accurately capture and represent the intricate details of the church and the inscriptions?
  • What methods can be employed to ensure the precise alignment and placement of the digitized inscriptions within the 3D model of the church?
  • How can the digital representation of the church and its inscriptions be effectively integrated with the database of Armenian inscriptions in Nagorno-Karabakh?
  • What insights can be gained from analysing the spatial distribution and arrangement of the inscriptions within the church, shedding light on the historical and cultural context in which they were created?
  • How can the integration of the 3D model and the digitized inscriptions contribute to the preservation, documentation, and study of Armenian epigraphic heritage in Nagorno-Karabakh?

Objective

This project aims to deepen the knowledge of the architectural and epigraphic significance of the church, explore innovative techniques for digitizing and visualizing cultural heritage, and contribute to the preservation and accessibility of Armenian inscriptions in Nagorno-Karabakh.

Main steps

  1. Research and Planning: Conduct thorough research on the church and its inscriptions, architectural features, and existing documentation, and develop a detailed plan for creating a 3D model and locating the inscriptions within the model.
  2. Study of the plan, section, and elevation views from a survey: Proportional and compositional study aimed at the 3D reconstruction.
  3. 3D reconstruction: A low level of detail mesh already exists, but with this project, the student will try to transfer the information from the architectural survey to a refined architectural 3D model with interiors and exteriors.
  4. Data Processing and Digitization: Process the collected data to digitally represent the church and the inscriptions.
  5. Inscription Localization: Analyse the collected data and determine the precise location of each inscription within the 3D model of the church.
  6. Data Integration: Ensure that the essential information, such as language data, translations, geographical and chronological data, monument and inscription types, bibliographic references, and photographs, are correctly linked to the localised inscription.

Explored methods

  • Proportional analysis
  • 3D modelling using Rhino
  • 3D segmentation and annotation with the inscription
  • Exploration of visualization methodologies for this additionally embedded information

Requirements

  • Previous experience with architectural 3D modelling using Rhino.


[1] A toponym used by the local Armenians to refer to Nagorno-Karabagh territory
[2] the European Parliament resolution on the destruction of cultural heritage in Nagorno-Karabakh (2022/2582(RSP)) dated 09.03.20022.

Type: MSc Semester project (12 ECTS) or MSc project
Sections: Data Science, Computer Science, Digital Humanities
Supervisors: Frédéric Kaplan, Emanuela Boros, Maud Ehrmann 
Number of students: 1

Context

This project aims to explore the application of complex systems theory to Large Language Models (LLMs), like GPT. It will focus on understanding how concepts such as trajectory divergence, attractors, and chaotic sequences manifest in these advanced AI models. The project will use an LLM trained via reinforcement learning, providing a unique lens to examine the behavior and characteristics of these complex systems.
 
Objectives
  1. To investigate trajectory divergence in LLMs: We will study how minor variations in input (such as small changes in text) can lead to significantly different outputs, illustrating sensitivity to initial conditions.
  2. To identify attractors in LLMs: We will explore if there are recurring themes or patterns in the model’s outputs that act as attractors, regardless of varied inputs.
  3. To analyze chaotic sequences in model responses: By feeding a series of chaotic or nonlinear inputs, we aim to understand how the model’s responses demonstrate characteristics of chaotic systems.
  4. To utilize reinforcement learning in training LLMs: To observe how the introduction of reward-based training influences the development of these complex behaviors.
 
Methodology
 
  1. Data Collection and Preparation: We will generate a diverse set of input data to feed into the LLM, ensuring a range that can test for trajectory divergence and chaotic behavior.
  2. Model Training: An LLM will be trained using reinforcement learning techniques to adapt its response strategy based on predefined reward systems.
  3. Experimentation: The trained model will be subjected to various tests, including slight input modifications and chaotic input sequences, to observe the outcomes and patterns.
  4. Analysis and Visualization: Data analysis tools will be used to interpret the results, and visualization techniques will be applied to illustrate the complex dynamics observed.
Expected Outcomes
  • A deeper understanding of how complex system theories apply to LLMs.
  • Insights into the stability, variability, and predictability of LLMs.
  • Identification of potential attractor themes or patterns in LLM outputs.
  • A contribution to the broader discussion on AI behavior and its implications.

Requirements

Excellent technical skills, previous practical experience with LLMs and passion for the subject.

Type: MSc Semester project (12 ECTS) or MSc project
Sections: Data Science, Computer Science, Digital Humanities
Supervisors: Frédéric Kaplan, Emanuela Boros, Maud Ehrmann 
Number of students: 1

Context

The project aims to develop a compiler for GPT models that transcends traditional prompt engineering, enabling the creation of more structured and complex written pieces. Drawing parallels from the early days of computer programming, where efficiency and compactness in commands were crucial due to memory and space constraints, this project seeks to elevate the way we interact with and utilize Large Language Models (LLMs) for text generation.

Objectives

  1. To Create a Compiler for Enhanced Text Generation: Develop a compiler that translates user intentions into complex, structured narratives, moving beyond simple prompt responses.
  2. To Establish ‘Libraries’ for Complex Writing Projects: Similar to programming libraries, these would contain comprehensive information about characters, settings, and narrative logic, which can be loaded at the start of a writing session.
  3. To Facilitate Hierarchical Abstraction in Writing: Implement a system that allows for the creation of high-level abstractions in storytelling, akin to programming.
  4. To Enable Specialization in Narrative Elements: Support the development of specialized modules for characters, settings, narrative logic, and stylistic effects.

Methodology

  • Compiler Design: Designing a compiler capable of interpreting and translating complex narrative instructions into executable text generation tasks for LLMs like GPT.
  • Library Development: Creating a framework for users to build and store detailed narrative elements (characters, settings, etc.) that can be referenced by the compiler.
  • Abstraction Layers Implementation: Developing a system to manage and utilize different levels of narrative abstraction.
  • Integration with Various LLMs: Ensuring the compiler is adaptable to different LLMs, including OpenAI, Google, or open-source models.
  • Testing and Iteration: Conducting extensive testing to refine the compiler and its ability to handle complex writing tasks.

Expected Outcomes

  • A tool that allows for the creation of detailed and structured written works using LLMs.
  • A new approach to text generation that mirrors the evolution and specialization seen in computer programming.
  • Contributions to the field of AI-driven creative writing, enabling more complex and nuanced storytelling.
 
Requirements

Excellent technical skills, previous practical experience with LLMs and passion for the subject. 
  • guidance is a programming paradigm that offers superior control and efficiency compared to conventional prompting and chaining.

Type: MSc Semester project (12 ECTS) or MSc project
Sections: Data Science, Computer Science, Digital Humanities
Supervisors: Emanuela Boros, Maud Ehrmann 
Number of students: 1

Figure: Historical Instruction Mining for (finetuning) Large Language Models (Generated with Midjourney)

Context

Historical collections present multiple challenges that depend either on the quality of digitization, the need to handle documents deteriorated by the effect of time, the poor quality printing materials or inaccurate scanning processes such as optical character recognition (OCR) or optical layout recognition (OLR). Moreover, historical collections can pose another challenge due to the fact that documents are distributed over a long enough period of time to be affected by language change. This is especially true in the case of Western European languages, which only acquired their modern spelling standards roughly around the 18th or 19th centuries. At the same time, language models (LMs) have been leveraged for several years now, obtaining state-of-the-art performance in the majority of NLP tasks, by generally being fine-tuned on downstream tasks (such as entity recognition). LLMs or instruction-following models have taken over with relatively new capabilities in solving some of these tasks in a zero-shot manner through prompt engineering. For example, the generative pre-trained transformer (GPT) family of LLMs refers to a series of increasingly powerful and large-scale neural network architectures developed by OpenAI. Starting with GPT, the subsequent models have witnessed substantial growth, such as ChatGPT, GPT-4. These increased sizes allow the models to capture more intricate patterns in the training data, resulting in better performance on various tasks (like acing exams). Nevertheless, they seem to fail in understanding and reasoning when it handles historical documents. This project aims to at building a dataset in a semi-automatic manner for improving the application of LLMs in historical data analysis in the context of impresso – Media Monitoring of the Past II. Beyond Borders: Connecting Historical Newspapers and Radio project.

Research Questions

  • Is it feasible to create a dataset for training LLMs to better comprehend historical documents, using semi-automatic or automatic methods?
  • Can a specialized, resource-efficient LLM effectively process and understand noisy, historical digitized documents?

Objective

To develop an instruction-based dataset to improve LLMs’ capabilities in interpreting historical documents. The focus will be on sourcing and analyzing historical Swiss and Luxembourgish newspapers (spanning 200 years) and other historical collections in ancient Greek or Latin.

Instruction/Prompt: When was the Red Cross founded?

Example Answer: 1864

Instruction / Prompt: Given the following excerpt from a Luxembourgish newspaper from 1919, identify the main event and key figures involved. Excerpt: “En 1919, la Grande-Duchesse Charlotte est montée sur le trône du Luxembourg, succédant à sa sœur, la Grande-Duchesse Marie-Adélaïde, qui avait abdiqué en raison de controverses liées à la Première Guerre mondiale.”

Example Response: Grand Duchess Charlotte and her sister, Grand Duchess Marie-Adélaïde.

Main Steps

  1. Data Curation:
    • Collect OCR-based datasets.
    • Analyze historical newspaper articles to understand common features and challenges.
  2. Dataset Creation:
    • Decide on what type of instruction should be generated and utilise other existing LLMs such as T5, BART, etc., to generate instructions (or questions) from Swiss historical documents, similar to the method presented in ArchivalQA.
    • Train or finetune a LLaMA language model based on this dataset.
  3. Model Training/Fine-Tuning:
    • Train or fine-tune a language model like LLaMA on this dataset.
  4. Evaluation:
    • Assess LLMs’ performance on NLP tasks (NER, EL) using historical documents.
    • Compare models trained on the new dataset with those trained on standard datasets.
    • Employ metrics like accuracy, perplexity, F1 score.

Requirements

  • Proficiency in Python, ideally PyTorch.
  • Strong writing skills.
  • Commitment to the project.

Output

  • Potential publications in NLP and historical document processing.
  • Contribution to advancements in handling historical texts with LLMs.

Deliverables

  • A comprehensive dataset for training LLMs on historical texts.
  • A report or paper detailing the methodology, findings, and implications of the project.

References

Optical character recognition (OCR) or optical character reader is the electronic or mechanical conversion of images of typed, handwritten or printed text into machine-encoded text, whether from a scanned document, a photo of a document, a scene photo or from subtitle text superimposed on an image.

Taken

Type: MSc Semester project (12 ECTS) or MSc project
Sections: Data Science, Computer Science, Digital Humanities
Supervisors: Frédéric Kaplan, Emanuela Boros, Maud Ehrmann 
Number of students: 1

Context

This project aims to design and develop an innovative interface for iterative text composition, leveraging the capabilities of Large Language Models (LLMs) like GPT. The interface will enable users to collaboratively compose texts with the LLM, providing control and flexibility in the creative process.

Objectives

  1. To create a user-friendly interface for text composition: The interface should allow users to input, modify, and refine text generated by the LLM.
  2. To enable iterative interaction: Users should be able to interact iteratively with the LLM, adjusting and fine-tuning the generated text according to their needs and preferences.
  3. To incorporate customization options: The system should offer options to tailor the style, tone, and thematic elements of the generated text.

Methodology

  • Interface Design: Designing a user-friendly interface that allows for easy input and manipulation of text generated by the LLM.
  • LLM Integration: Integrating a LLM into the interface to generate text based on user inputs and interactions.
  • Customization and Control Features: Implementing features that allow users to customize the style and tone of the text and maintain control over the content.
  • User Testing and Feedback: Conducting user testing sessions to gather feedback and refine the interface and its functionalities.

Expected Outcomes

  • A functional interface that allows for collaborative text composition with a LLM.
  • Enhanced user experience in text creation, providing a blend of AI-generated content and human creativity.
  • Insights into how users interact with AI in creative processes.
 
Requirements

Excellent technical skills, previous practical experience with LLMs and passion for the subject. 
  • guidance is a programming paradigm that offers superior control and efficiency compared to conventional prompting and chaining.

Type: BA (8ECST) Semester project, MSc (12 ECTS)
Sections: Digital Humanities, Data Science, Computer Science
Supervisor: Hamest Tamrazyan, Emanuela Boros
Number of students: 1–2

Context

This project aims to explore, retrieve, and analyse data from the Digital Laboratory of Ukraine, focusing on a select collection of approximately ten books related to epigraphy and cultural heritage. The primary objective is to gain insights into Ukraine’s epigraphic and cultural heritage through detailed data analysis, term extraction, and database management.

Objective

This project aims to deepen the knowledge of the architectural and epigraphic significance of the church, explore innovative techniques for digitizing and visualizing cultural heritage, and contribute to the preservation and accessibility of Armenian inscriptions in Nagorno-Karabakh.

Objectives and Main steps

  1. Data Retrieval: Collect and aggregate data from the Digital Laboratory of Ukraine, specifically targeting books and resources about epigraphy and cultural heritage.
  2. Data Cleaning and Formatting: Implement data preprocessing techniques to ensure data quality. This includes removing irrelevant or corrupt data, handling missing values, and standardizing formats.
  3. Database Setup: Design and implement a database to store and manage retrieved data efficiently. The database should allow easy access and manipulation of the data for analysis.
  4. Term Extraction and Analysis: Employ natural language processing (NLP) techniques to extract key terms, concepts, and thematic elements from the texts. This will help us understand the predominant themes and patterns in Ukrainian epigraphy and cultural heritage.

Requirements

  • Proficiency in Python, knowledge of NLP techniques.

Significance

This project will contribute to the understanding of Ukraine’s rich cultural heritage. It will provide valuable digital resources for future academic and cultural research in this field.

Type: MSc Semester project (12 ECTS)
Sections: Data Science, Computer Science, Digital Humanities
Supervisors: Alexander Rusnak
Number of students: 1-3

Figure: Illustration of the SALMON training pipeline.

Context

All language models are taught some form of ethical system, whether implicitly through the curation of the dataset or by utilising some form of explicit training and prompting scheme. One type of ethical guidance framework is the Constitutional AI system proposed by the Anthropic team; this approach is predicated on prompting a language model to revise its own responses relative to a set of values which are then used to retrain the language model utilising supervised finetuning and reinforcement learning tutored by a preference model as per the standard RLHF setup. This approach has shown very strong results in improving both the ‘harmlessness‘ (i.e. ethical behaviour) and ‘helpfulness‘ of language models. However, the deontological ethical system they utilised has some key drawbacks.

Objective

This project will attempt to encode a virtue ethics framework into the model both in the selection of the values by which the responses are revised but also in the architectural structure itself. Virtue ethics focuses on three types of evaluation: the ethicality of the action itself, the motivation behind the action, the utility of the action towards promoting a virtuous character in the agent. To this end, the student will implement a separate preference model specifically for each of these three avenues of moral evaluation that will then be used for RL training of an LLM assistant. This should result in a model that has increases in both harmlessness and helpfulness, but also in explainability.

Main Steps

  1. Curate a dataset of adversarial prompts and ethics-oriented prompts to be used for training.
  2. Implement a reinforcement learning from AI feedback training structure following from Anthropic’s Claude or IBM’s SALMON.
  3. Create a custom prompting pipeline for the virtuous action preference model, motivational explanation preference model, and virtue formation preference model.
  4. Train the chatbot using each of the preference models separately and finally combined, and measure their comparative performance on difficult ethical questions.

Requirements

  • Knowledge of machine learning and deep learning principles, familiarity with language models, proficiency in Python and Pytorch, and interest in ethics and philosophy.

References

Fall 2023

Available

Type: MSc Semester project
Sections: Architecture, Digital Humanities, Data Science, Computer Science
Supervisor: Beatrice Vaienti, Hamest Tamrazyan
Keywords: 3D modelling

Number of students: 1–2 (min. 12 ECTS in total)

Context: The Armenian cultural heritage in Artsakh [1] is in danger after explicit threats of irreversible destruction coordinated by Azerbaijan authorities [2]. As part of a series of actions coordinated by the EPFL, the Digital Humanities Institute is currently prototyping methods to offer rapid deployment of DH technology in situations of crisis. As part of these actions, over 200 Armenian inscriptions of Artsakh with essential information such as the language data including diplomatic and interpretive transcriptions, the translation into English, the location of the inscription on the monument (if applicable), geographical and chronological data, the type of monument, and the type of inscription systematize and digitize the inscriptions on the monuments of Armenian cultural heritage in Nagorno-Karabakh. This digitized data will help not only to preserve the invaluable inscriptions but also can be used for further investigations and research.  The aim of this project is to create a 3D model of the church successfully, accurately locate the inscriptions, and contribute to preserving, studying, and promoting Armenian cultural heritage in Nagorno-Karabakh.

Figure: By Julian Nyča – Own work, CC BY-SA 3.0

Research questions:

  • How can advanced imaging and 3D modelling technologies be utilized to accurately capture and represent the intricate details of the church and the inscriptions?
  • What methods can be employed to ensure the precise alignment and placement of the digitized inscriptions within the 3D model of the church?
  • How can the digital representation of the church and its inscriptions be effectively integrated with the database of Armenian inscriptions in Nagorno-Karabakh?
  • What insights can be gained from analysing the spatial distribution and arrangement of the inscriptions within the church, shedding light on the historical and cultural context in which they were created?
  • How can the integration of the 3D model and the digitized inscriptions contribute to the preservation, documentation, and study of Armenian epigraphic heritage in Nagorno-Karabakh?

Objectives: This project aims to deepen the knowledge of the architectural and epigraphic significance of the church, explore innovative techniques for digitizing and visualizing cultural heritage, and contribute to the preservation and accessibility of Armenian inscriptions in Nagorno-Karabakh.

Main steps:

  • Research and Planning: Conduct thorough research on the church and its inscriptions, architectural features, and existing documentation, and develop a detailed plan for creating a 3D model and locating the inscriptions within the model.
  • Study of the plan, section, and elevation views from a survey: Proportional and compositional study aimed at the 3D reconstruction.
  • 3D reconstruction: A low level of detail mesh already exists, but with this project, the student will try to transfer the information from the architectural survey to a refined architectural 3D model with interiors and exteriors.
  • Data Processing and Digitization: Process the collected data to digitally represent the church and the inscriptions.
  • Inscription Localization: Analyse the collected data and determine the precise location of each inscription within the 3D model of the church.
  • Data Integration: Ensure that the essential information, such as language data, translations, geographical and chronological data, monument and inscription types, bibliographic references, and photographs, are correctly linked to the localized inscription.

Explored methods:

  • Proportional analysis
  • 3D modelling using Rhino
  • 3D segmentation and annotation with the inscription
  • Exploration of visualization methodologies for this additionally embedded information

Requirements: previous experience with architectural 3D modelling using Rhino.


[1] A toponym used by the local Armenians to refer to Nagorno-Karabagh territory
[2] the European Parliament resolution on the destruction of cultural heritage in Nagorno-Karabakh (2022/2582(RSP)) dated 09.03.20022.

Type: MSc Semester project (12 ECTS) or MSc project
Sections: Data Science, Computer Science, Digital Humanities
Supervisors: Emanuela Boros, Maud Ehrmann
Number of students: 1–2

Figure: Historical Instruction Mining for (finetuning) Large Language Models (Generated with Midjourney)

Context: Historical collections present multiple challenges that depend either on the quality of digitization, the need to handle documents deteriorated by the effect of time, the poor quality printing materials or inaccurate scanning processes such as optical character recognition (OCR) or optical layout recognition (OLR). Moreover, historical collections can pose another challenge due to the fact that documents are distributed over a long enough period of time to be affected by language change. This is especially true in the case of Western European languages, which only acquired their modern spelling standards roughly around the 18th or 19th centuries. At the same time, language models (LMs) have been leveraged for several years now, obtaining state-of-the-art performance in the majority of NLP tasks, by generally being fine-tuned on downstream tasks (such as entity recognition). LLMs or instruction-following models have taken over with relatively new capabilities in solving some of these tasks in a zero-shot manner through prompt engineering. For example, the generative pre-trained transformer (GPT) family of LLMs refers to a series of increasingly powerful and large-scale neural network architectures developed by OpenAI. Starting with GPT, the subsequent models have witnessed substantial growth, such as ChatGPT, GPT-4. These increased sizes allow the models to capture more intricate patterns in the training data, resulting in better performance on various tasks (like acing exams). Nevertheless, they seem to fail in understanding and reasoning when it handles historical documents. This project aims to at building a dataset in a semi-automatic manner for improving the application of LLMs in historical data analysis.

Research Questions:

  • Can we create a dataset in a (semi-automatic/automatic) manner for training an LLM to better understand historical documents?
  • Can a specialized, resource-efficient LLM effectively process noisy historical digitised documents?

Objective: The objective of this project is to develop an instruction-based dataset to enhance the ability of LLMs to understand and interpret historical documents. This will involve sourcing historical Swiss and Luxembourgish newspapers spanning 200 years, as well as other historical collections such as those in ancient Greek or Latin. Two fictive examples:

Instruction/Prompt: When was the Red Cross founded?

Example Answer: 1864

Instruction / Prompt: Given the following excerpt from a Luxembourgish newspaper from 1919, identify the main event and key figures involved. Excerpt: “En 1919, la Grande-Duchesse Charlotte est montée sur le trône du Luxembourg, succédant à sa sœur, la Grande-Duchesse Marie-Adélaïde, qui avait abdiqué en raison de controverses liées à la Première Guerre mondiale.”

Example Response: Grand Duchess Charlotte and her sister, Grand Duchess Marie-Adélaïde.

Main Steps:

  • Data Curation: Gather datasets based on OCR level, and familiarize with the corpus by exploring historical newspaper articles. Identify common features of historical documents and potential difficulties.
  • Dataset Creation: Decide on what type of instruction should be generated and utilise other existing LLMs such as T5, BART, etc., to generate instructions (or questions) from Swiss historical documents.
  • (Additional) Train or finetune a LLaMA language model based on this dataset.
  • (Additional) Evaluation: Utilise the newly created dataset to evaluate existing LLMs and assess their performance on various NLP tasks such as named entity recognition (NER) and linking (EL) in historical documents. Use standard metrics such as accuracy, perplexity, or F1 score for the evaluation. Compare the performance of models trained with the new dataset against those trained with standard datasets to ascertain the effectiveness of the new dataset.

Requirements: Proficiency in Python, preferably PyTorch, excellent writing skills, and dedication to the project.

Outputs: The project’s results could potentially lead to publications in relevant research areas and would contribute to the field of historical document processing.

Type: MSc Semester project (12 ECTS) or MSc project
Sections: Data Science, Computer Science, Digital Humanities
Supervisors: Emanuela Boros, Maud Ehrmann
Number of students: 1-2

Figure: Forecasting News (Generated with Midjourney)

Context: The rapid evolution and widespread adoption of digital media have led to an explosion of news content. News organizations are continuously publishing articles, making it increasingly challenging to keep track of daily developments and their implications. To navigate this overwhelming amount of information, computational methods capable of processing and making predictions about this content are of significant interest. With the advent of advanced machine learning techniques, such as Generative Adversarial Networks (GANs), and Large Language Models (LLMs), it’s possible to forecast future content based on existing articles. This project proposes to leverage the strengths of both GANs and LLMs to predict the content of next-day articles based on current-day news. This approach will not only allow for a better understanding of how events evolve over time, but also could serve as a tool for news agencies to anticipate and prepare for future news developments.

Research Questions:

  1. Can we design a system that effectively leverages GANs and LLMs to predict next-day article content based on current-day articles?
  2. How accurate are the generated articles when compared to the actual articles of the next day (how close to reality are they)?
  3. What are the limits and potential biases of such a system and how can they be mitigated?

Objective: The objective of this project is to design and implement a system that uses a combination of GANs and LLMs to predict the content of next-day news articles. This will be measured by the quality, coherence, and accuracy of the generated articles compared to the actual articles from the following day.

Main Steps:

  • Dataset Acquisition: Procure a dataset consisting of sequential daily articles from multiple sources.
  • Data Preprocessing: Clean and preprocess the data for training. This involves text normalization, tokenization, and the creation of appropriate training pairs.
  • Generator Network Design: Leverage an LLM as the generator network in the GAN. This network will generate the next-day article based on the input from the current-day article.
  • Discriminator Network Design: Build a discriminator network capable of distinguishing between the actual next-day article and the generated article.
  • GAN Training: Train the GAN system by alternating between training the discriminator to distinguish real vs generated articles, and training the generator to fool the discriminator.
  • Evaluation: Assess the generated articles based on measures of text coherence, relevance, and similarity to the actual next-day articles.
  • Bias and Limitations: Examine and discuss the potential limitations and biases of the system, proposing ways to address these issues.

Master Project Additions:

If the project is taken as a master project, the student will further:

  • Refine the Model: Apply advanced training techniques, and perform a detailed hyperparameter search to optimize the GAN’s performance.
  • Multi-Source Integration: Extend the model to handle and reconcile articles from multiple sources, aiming to generate a more comprehensive next-day article.
  • Long-Term Predictions: Investigate the model’s capabilities and limitations in making longer-term predictions, such as a week or a month in advance.

Requirements: Knowledge of machine learning and deep learning principles, familiarity with GANs and LLMs, proficiency in Python, and experience with a deep learning framework, preferably PyTorch.

Taken

Type: BA (8ECST) Semester project
Sections: Data Science, Computer Science, Digital Humanities
Supervisors: Emanuela Boros
Number of students: 1–2

Keywords: Document processing, Natural Language Processing (NLP), Information Extraction, Machine Learning, Deep Learning

Figure: Assessing Climate Change Perceptions and Behaviours in Historical Newspapers (Generated with Midjourney)

With emissions in line with current Paris Agreement commitments, global warming is projected to exceed 1.5°C above pre-industrial levels, even if these commitments are complemented by very difficult increases in magnitude and intensity and ambition of mitigation after 2030. Despite this slight increase, the consequences of global warming are already observable today, with the number and intensity of certain natural hazards continuing to increase (e.g., extreme weather events, floods, forest fires). Near-term warming and increased frequency, severity, and duration of extreme events will place many terrestrial, freshwater, coastal and marine ecosystems at high or very high risks of biodiversity loss. Exploring historical documents can help to address gaps in our understanding of the historical roots of climate change, and possibly uncover evidence of early efforts to address environmental issues, as well as explore how environmentalism has evolved over time. This project aims to fill gaps in our understanding by examining a corpus of historical Swiss and Luxembourgish newspapers spanning 200 years (i.e., the impresso project’s corpus).

Research Questions:

  • How have perceptions of climate change evolved over time, as seen in historical newspapers?
  • What behavioural trends towards climate change can be identified from historical newspapers?
  • Can we track the frequency and intensity of extreme weather events over time based on historical documents?
  • Can we identify any patterns or trends in early efforts to address environmental issues?
  • How has the sentiment towards climate change and environmentalism evolved over time?

Objective: This work explores several NLP techniques (text classification, information extraction, etc.) for providing a comprehensive understanding of the evolution and reporting of extreme weather events in historical documents.

Main Steps:

  • Data Preparation: Identify relevant keywords and phrases related to climate change and environmentalism, such as “global warming”, “carbon emissions”, “climate policy”, or “hurricane”, “flood”, “drought”, “heat wave”, and others. Compile a training dataset of articles that are around these relevant keywords.
  • Data Analysis: Analyse the data and identify patterns in climate change perceptions and behaviours over time. This includes the identification of changes in the frequency of climate-related terms, changes in sentiment towards climate change, changes in the topics discussed in relation to climate change, the detection of mentions of locations, persons, or events, and the extraction of important keywords in weather forecasting news.

Requirements: Candidates should have a background in machine learning, data engineering, and data science, with proficiency in NLP techniques such as Named Entity Recognition, Topic Detection, or Sentiment Analysis. A keen interest in climate change, history, and media studies is also beneficial.

Resources:

  1. Analyzing the impact of climate change on critical infrastructure from the scientific literature: A weakly supervised NLP approach
  2. Climate of scepticism: US newspaper coverage of the science of climate change
  3. We provide a nlp-beginner-starter jupyter notebook.

Type: MSc (12 ECTS) or BA (8ECST) Semester project
Sections: Computer Science, Data Science, Digital Humanities
Supervisors: Maud Ehrmann, Emanuela Boros
Number of students: 1–2

Figure: Exploring Large Vision-Language Pre-trained Models for Historical Images Classification and Captioning (Generated with Midjourney)

Context: The impresso project features a dataset of around 90 digitised historical newspapers containing approximately 3 million images. These images have no labels, and only 10% of them have a caption, two aspects that hinder their retrieval.

Two previous student projects focused on the automatic classification of these images, trying to identify 1. the type of image (e.g. photograph, illustration, drawing, graphic, cartoon, map), and 2. in the case of maps, the country or region of the world represented on that map. Good performances on image type classification were achieved by fine-tuning the VGG-16 pre-trained model (see report).

Objective: On the basis of these initial experiments and the annotated dataset compiled on this occasion, the present project will explore recent large-scale language-vision pre-trained models. Specifically, the project will attempt to: 

  1. Evaluate zero-shot image type classification of the available dataset using the CLIP and LLaMA-Adapter-V2 Multi-modal models, and compare with previous performances;
  2. Explore and evaluate image captioning of the same dataset, including trying to identify countries or regions of the world. This part will require adding caption information on the test part of the dataset. In addition to the fluency and accuracy of the generated captions, a specific aspect that might be taken into account is distinctiveness, i.e. whether the image contains details that differentiate it from similar images.

Given the recency of the models, further avenues of exploration may emerge during the course of the project, which is exploratory in nature.

Requirements: Knowledge of machine learning and deep learning principles, familiarity with computer vision, proficiency in Python, experience with a deep learning framework (preferably PyTorch), and interest in historical data.

References:

Spring 2023

Available

Type: MSc Semester project or Master project
Sections: Computer Science, Data Science, Digital humanities, Architecture
Supervisor: Frédéric Kaplan (CDH DHLAB) and Katrin Beyer (ENAC EESD)
Keywords: Historical architectural drawing, Automatic segmentation, Information Extraction. 
 
The goal of this Master or Semester project is extract information from historical architectural drawings. It is part of a larger project which goal is to develop a data acquisition and post-processing pipeline for deriving the exterior and interior geometry of historical buildings in terms of 3D point clouds. While images using for photogrammetry modelling contain a lot of information, they do not contain all geometric information of a structure that is relevant for architectural and structural engineering applications. For historical stone masonry buildings examples are embedment length of floor beams in walls or floor beam orientation, beam size and spacing in case of suspended ceilings. Such information can be sometimes found in historical architectural drawings. Comparing the as-built model to historical architectural drawings can also point to modifications to the structure and therefore serve as input for 4D geometric representations of the model. Furthermore, floor plans can serve as input when planning the data acquisition of interior spaces. For these reasons, we will develop approaches for automated reading of these features from historical architectural drawings. 
 
To extract information from historical architectural drawings we will build on methods for automated reading of modern construction floorplans  and the methods developed by the DHLAB for automated vectorisation of historical cadastral maps  to develop methods for historical floorplans and historical sections. The goal is to extract information on the floor plan and floor beam geometry, orientation, spacing and embedment length. For this purpose, we will retrieve and where necessary digitize historical architectural drawings of stone masonry buildings with timber floors in Swiss cultural heritage archives and complement this Swiss data with drawings from the many online architectural archives. As a first case study we may investigate the Old Hospital of Sion, which is owned by the city of Sion. The building was first mentioned in 1163 and has been extended and modified over the centuries. For this building, a large amount of documents in the form of texts, drawings and photos are available and have recently been reviewed for a seismic safety assessment of the building.
 
 
 

Taken

Type: MSc Semester project
Sections: Digital humanities, Data Science, Computer Science
Supervisor: Rémi Petitpierre
Keywords: Data visualisation, Web design, IIIF, Memetics, History of Cartography
Number of students: 1–2 (min. 12 ECTS in total)

Context: Cultural evolution can be modeled as an ensemble of ideas, and conventions, that spread from one mind to another. This is referred to as memes, elementary replicators of culture inspired by the concept of a gene. A meme can be a tune, a catch-phrase, or an architectural detail, for instance. If we take the example of maps, a meme can be expressed in the choice of a certain texture, a colour, or a symbol, to represent the environment. Thus, memes spread from one cartographer to another, through time and space, and can be used to study cultural evolution. With the help of computer vision, it is now possible to track those memes, by finding replicated visual elements through large datasets of digitised historical maps.

Below is an example of visual elements extracted from a 17th century map (original IIIF image). By extending the process to tens of thousands of maps and embedding these elements in a feature space, it becomes possible to estimate which elements correspond to the same replicated visual idea, or meme. This opens up new ways to understand how ideas and technologies spread across time and space.

Example extraction of the elementary visual elements from a 17th century French map.

Despite the immense potential that such data holds for better understanding cultural evolution, it remains difficult to interpret, since it involves tens of thousands memes, corresponding to millions of visual elements. Somewhat like genomics research, it is now becoming essential to develop a microscope to observe memetic interactions more closely.

Objectives: In this project, the student will tackle the challenge of designing and building a prototype interface for the exploration of the memes on the basis of replicated visual elements. The scientific challenge is to create a design that reflects the layered and interconnected nature of the data, composed of visual elements, maps, and larger sets. The student will develop its project by working with digital embeddings of replicated visual elements, and digitised images in IIIF framework. The interface will make use of the metadata to visualise how time, space, as well as the development of new technologies influence cartographic figuration, by filtering the visual elements. Finally, to reflect the multi-layered nature of the data, the design must be transparent and provide the ability to switch between visual elements and their original maps.
The project will draw on an exceptional dataset of tens of thousands of American, French, German, Dutch, and Swiss maps published between 1500 and 1950. Depending on the student’s interests, an interesting case study to demonstrate the benefits of the interface could be to investigate the impact of the invention of lithography, a revolutionary technology from the end of the 18th century, on the development of modern cartographic representations.

Prerequisites: Web Programming, basics of JavaScript.

Type: MSc (12 ECTS) or BA (8ECST) Semester project
Sections: Digital humanities, Computer Science
Supervisor: Beatrice Vaienti
Keywords: OCR, database
Number of students: 1

Context: An ongoing project at the Lab is focusing on the creation of a 4D database of the urban evolution of Jerusalem between 1840 and 1940. This database will not only depict the city in time as a 3D model, but also embed additional information and metadata about the architectural objects, thus employing the spatial representation of the city as a relational database. However, the scattered, partial and multilingual nature of the available historical sources underlying the construction of such a database makes it necessary to combine and structure them together, extracting from each one of them the information describing the architectural features of the buildings. 

Two architectural catalogues, “Ayyubid Jerusalem” (1187-1250) and “Mamluk Jerusalem: an Architectural Study”, contain respectively 22 and 64 chapters, each one describing a building of the Old City in a thorough way. The information they provide includes for instance the location of the buildings, their history (founder and date), an architectural description, pictures and technical drawings. 

Objectives: Starting from the scanned version of  these two books, the objective of the project is to develop a pipeline to extract and structure their content in a relational spatial database. The content of each chapter is structured in sections that cover systematically the various aspects of each building’s architecture and history. Along with this already structured information photos and technical drawings are accompanying the text: the richness of the images and architectural representations in the books should also be considered and integrated in the project outcomes. Particular emphasis will be placed on the extraction of information about the architectural appearance of buildings, which can then be used for procedural modelling tasks. Given these elements, three main sub-objectives are envisioned:

  1. OCR ;
  2. Organization of the extracted text in the original sections (or in new ways) and extraction of the information pertaining the architectural features of the buildings;
  3. Encoding the information in a spatial DB, using the locational information present in each chapter to geolocate the building, eventually associating its position with the existing geolocated building footprints.

Prerequisites: basic knowledge of database technologies and Python 

Type: MSc Semester project or Master project
Sections: Computer Science, Data Science, Digital humanities
Supervisor: Maud Ehrmann, as well as historians and network specialists from the C2DH Center from Luxembourg University.
Keywords: Document processing, NLP, machine learning

Context: News agencies (e.g. AFP, Reuters) have always played an important role in shaping the news landscape. Created in the 1830s and 1840s by groups of daily newspapers in order to share the costs of news gathering (especially abroad) and transmission, news agencies have gradually become key actors in news production, responsible for providing accurate and factual information in the form of agency releases. While studies exist on the impact of news agency content on contemporary news, little research has been done on the influence of news agency releases over time. During the 19C and 20C, to what extent did journalists rely on agency content to produce their stories? How was agency content used in historical newspapers, as simple verbatims (copy and paste) or with rephrasing? Were they systematically attributed or not? How did news agency releases circulate among newspapers and which ones went viral?

Objective: Based on a corpus of historical Swiss and Luxembourgish newspapers spanning 200 years (i.e., the impresso project’s corpus), the goal of this project is to develop a machine learning-based classifier to distinguish newspaper articles based on agency releases from other articles. In addition to detecting agency releases, the system could also identify the news agency behind the release.

Main steps:

  • Ground preparation with a) exploration of the corpus (via the impresso interface) to become familiar with historical newspaper articles and identify typical features of agency content as well as potential difficulties; b) compilation of a list of news agencies active throughout the 19C and 20C.
  • Construction of a training corpus building (in French), based on:
    • sampling and manually annotation;
    • a collection of manually labelled agency releases, which could be automatically extended by using previously computed text reuse clusters (data augmentation).
  • Training and evaluation of two (or more) agency release classifiers:
    • a keyword baseline (i.e. where the article contains the name of the agency);
    • a neural-based classifier

Master project – If the project is taken as a master project, the student will also work on the following: 

Processing:

  • Multilingual processing (French, German, Luxembourgish);
  • Systematic evaluation and comparison of different algorithms;
  • Identification of the news agency behind the release; 
  • Fine-grained characterisation of text modification;
  • Application of the classifier to the entire corpus. 

Analysis:

  • Study of the distribution of agency content and its key characteristics over time (in a data science fashion).
  • Based on the computed data, study of information flows based on a network representation of news agencies (node), news releases (edges) and newspapers (node).

Eventually, such work will enable the study of news flows across newspapers, countries, and time.

Requirements: Good knowledge in machine learning, data engineering and data science. Interest in media and history.

Fall 2022

There were a few projects only and we no longer have the capacity to host projects for that period. Check out around December 2022 what will be proposed for Spring 2023!

Spring 2022

Available

Type of project : Master/Semester thesis
Supervisors: Frédéric Kaplan
Semester of project: Spring 2021
Project summary: This project aims at exploring the relevance of Transformers architectures for 3D cloud points. In a first series of experiments, the student will use the large 3D cloud points of models produced at the DHLAB for the city of Venice or the City of Sion and tries to predict missing parts. In a second series of experiments the student will use the SRTM (Shuttle Radar Topography Mission, a NASA mission conducted in 2000 to obtain elevation data for most of the world) to encode / decode terrain prediction. 

Contact: Prof. Frédéric Kaplan

Type of project: Semester

Supervisors: Didier Dupertuis and Paul Guhennec.

Context: The Federal Office of Topography (Swisstopo) has recently made accessible high-resolution digitizations of historical maps of Switzerland, covering every year between 1844 and 2018. In order to be able to use these assets for geo-historical research and analysis, the information contained in the maps must be converted from its visual representation to an abstract, geometrical, form. This conversion from an input raster to an output vector geometry is typically well-performed by combining Convolutional Neural Networks for pixelwise classification to standard computer vision techniques, but might prove challenging for datasets with a larger figurative diversity, like in the case of Swiss historical maps, whose style varies over time.


Objective: The ambition of this work is to develop a pipeline capable of transforming the buildings and roads of the Swisstopo set of historical maps into geometries correctly positioned in an appropriate Geographic Coordinates System. The student will have to train a pre-existing semantic segmentation neural network on a manually annotated training set, evaluate its success, and fine-tune it for it to be applicable on the entire dataset. Depending on the interest of the student, some first diachronic analyses of the road network evolution in Switzerland can be considered.

Prerequisites:

  • Good skills with Python.
  • Basics in machine learning and computer vision are a plus.


Supervisors: Paul Guhennec and Federica Pardini.

Context: In the early days of the 19th century and made possible by the Napoleonic invasions of Europe, a vast scale administrative endeavour started to cartography as faithfully as possible the geometry of most cities of Europe. What results from these so-called Napoleonic cadasters is a very precious testimony of the state of these cities in the past at the European scale. For matters of practicality and detail, most cities are represented on separate sheets of paper, with margin overlaps between them and indications on how to reconstruct the complete picture.
Recent work at the laboratory has shown that it is possible to make use of the great homogeneity in the visual representations of the parcels of the cadaster to automatically vectorize them and classify them according to a fixed typology (private building, public building, roads, etc). However, the process of aligning these cadasters with the “real” world, by positioning them in a Geographic Coordinate System, and thus allowing large-scale quantitative analyses, remains challenging.


Objective: A first problem to tackle is the combination of the different sheets to make up the full city. Building on a previous student project, the student will develop a process to automatically align neighbouring sheets, while accounting for the imperfections and misregistrations in these historical documents. In a second stage, a pipeline will be developed in order to align the combined sheets obtained at the previous step on contemporary geographic data.

Prerequisites:

  • Good skills in Python.
  • Experience with computer vision libraries

Type of project: Semester project or Master thesis

Supervisors: Didier Dupertuis and Frédéric Kaplan.

Context: In 1798, after a millennium as a republic, Venice was taken over by Napoleonic armies. A new centralized administration was erected in the former city-state. It went on to create many valuable archival documents giving a precise image of the city and its population.

The DHLAB just finished the digitization of two complementary sets of documents: the cadastral maps and its accompanying registries, the “Sommarioni”. The cadastral maps give an accurate picture of the city with clear delination of numbered parcels. The Sommarioni registers contain  information about each parcel, including a one-line description of its owner(s) and type of usage (housing, business, etc.).

The cadastral maps have been vectorized, with precise geometries and numbering for each parcel. The 230’000 records of the Sommarioni have been transcribed. Resulting datasets have been brought together and can be explored in this interactive platform (only available via EPFL intranet or VPN).

Objective: The next challenge is to extract structured data from the Sommarioni owner descriptions, i.e. to recognize and disambiguate people,  business and institution names. The owner description is a noisy text snippet mentioning several relatives’ names; some records only contain the information that they are identical to the previous one; institution names might have different spellings; and there are many homonyms among people names.

The ideal output would be a list of disambiguated institutions and people with, for the latter, the network of their family members.

The main steps are as follows:

  • Definition of entity typology (individual, family or collective, business, etc.);
  • Entity extraction in each record, handling the specificities of each type;
  • Entity disambiguation and linking between records;
  • Creation of a confidence score for the linking and disambiguation to quantify uncertainty, and of different scenarios for different degrees of uncertainty;
  • If time permits, analysis and discussion of results in relation to the Venice of 1808;
  • If time, integration of the results in the interactive platform.

Prerequisites:

  • Good knowledge of python and data-wrangling;
  • No special knowledge of Venetian history is needed;
  • Proficiency in Italian is not necessary but would be a plus.

Taken

Type of project: Semester

Supervisors: Sven Najem-Meyer (DHLAB)  Matteo Romanello (UNIL).

Context: Optical Character Recognition aims at transforming images into machine-readable texts. Though neural networks helped to improve performances, historical documents remain extremely challenging. Commentaries to classical Greek literature epitomize this difficulty, as systems must cope with noisy scans, complex layouts and mixed Greek and Latin scripts.

Objective:  The objective of the project is to produce a system that can solve the problem of this highly complex data. Depending on your skills and interests, the project can focus on :

  • Image pre-processing
  • Multitasking : can we improve OCR, by processing task like layout analysis or named-entity recognition in parallel?
  • Benchmarking and fine-tuning available frameworks
  • Optimizing post-processing with NLP

Prerequisites:

  • Good skills in python ; libraries such as OpenCV or PyTorch are a plus.
  • Good knowledge in machine learning is advised, bases in computer vision and image processing would be a real plus.
  • No knowledge of ancient Greek/literature is required.

Type of project: Semester

Supervisor: Maud Ehrmann

Objective: Given a large archive of historical newspapers (cf. the impresso project) containing both text and image material, the objective of this project is to :

1/ Learn a model to classify images according to their types e.g. map, photograph, illustration, comics, ornament, drawing, caricature).
This first step will consist in:

  • the definition of the typology by looking at the material – although the targeted typology will be rather coarse;
  • the annotation of a small training set;
  • the training of a model by fine-tuning an existing visual model;
  • the evaluation of the said model.

2/ Apply this model on a large-scale collection of historical newspapers (inference), and possibly do the statistical profile of the recognized elements through time

Required skills:

  • basic knowledge in computer vision
  • ideally experience with PyTorch

Type of project: Semester (ideally  done in loose collaboration with the other semester project on image classification)

Supervisor: Maud Ehrmann

Objective: Given a large archive of historical newspapers (cf. the impresso project) containing both text and image material, the objective of this project is to :

1/ Learn a model for binary image classification: map vs. non-map.
This first step will consist in:

  • the annotation of a small training set (this step is best done in collaboration with project on image classification);
  • the training of a model by fine-tuning an existing visual model;
  • the evaluation of the said model.

2/ Learn a model for map classification (which country or region of the world is represented)

  • first exploration and qualification of map types in the corpus.
  • building of a training set, prob. with external sources
  • the training of a model by fine-tuning an existing visual model;
  • the evaluation of the said model.

Required skills:

  • basic knowledge in computer vision
  • ideally experience with PyTorch
Type of project : Master or semester project. 
 
Supervisors : Frederic Kaplan (DHLAB), Julien Fargeot (LCAV)
 
Context : Since 1994, the Centre for UNESCO in the French city of Troyes has organised an annual international drawing competition. Each year, a theme is proposed for the competition, which sees the participation of young people from all over the world, aged between 3 and 25 years. The winners’ works are then exhibited and will soon be highlighted by the opening of a dedicated museum, but all the drawings have been preserved and recently digitised. The 115,000 or so works from 150 countries over more than 25 years constitute an exceptional collection and a window on the imagination of young people over 25 years. Recent techniques of data learning and analysis will make it possible to explore this unique database, which this project aims to initiate.
 
Objective : The ambition of the project is to mine the drawing database to find patterns linked with of other work from Art History. The algorithms and methods developed in the DHLAB Replica project for searching morphological pattern in large scale database of artworks will serve as as starting point for this research. The goal will be to explore how the geographical origin and the age of the young artist impact the use of certain kind of references or drawing techniques. 
 
 

Type of project: Semester

Supervisors: Matteo Romanello (DHLAB), Maud Ehrmann (DHLAB), Andreas Spitz (DLAB)

Context: Digitized newspapers constitute an extraordinary goldmine of information about our past, and historians are among those who can most benefit from it. Impresso, an ongoing, collaborative research project based at the DHLAB, has been building a large-scale corpus of digitized newspapers: it currently contains 76 newspapers from Switzerland and Luxembourg (written in French, German and Luxembourgish) for a total of 12 billion tokens. This corpus was enriched with several layers of semantic information such as topics, text reuse and named entities (persons, locations, dates and organizations). The latter are particularly useful for historians as co-occurrences of named entities often indicate (and help to identify) historical events. The impresso corpus currently contains some 164 million entity mentions, linked to 500 thousand entities from DBpedia (partly mapped to Wikidata).

Yet, making use of such a large knowledge graph in an interactive tool for historians — such as the tool impresso has been developing — requires an underlying document model that facilitates the retrieval of entity relations, contexts, and related events from the documents effectively and efficiently. This is where LOAD comes into play, which is a graph-based document model that supports browsing, extracting and summarizing real world events in large collections of unstructured text based on named entities such as Locations, Organizations, Actors and Dates.

Objective: The student will build a LOAD model of the impresso corpus. In order to do so, an existing Java implementation, which uses MongoDB as its back-end, can be used as a starting point. Also, the student will have access to the MySQL and Solr instances where impresso semantic annotations have already been stored and indexed. Once the LOAD model is built, an already existing front-end called EVELIN will be used to create a first demonstrator of how LOAD can enable entity-driven exploration of the impresso corpus.

Required skills:

  • good proficiency in Java or Scala
  • familiarity with graph/network theory
  • experience with big data technologies (Kubernetes, Spark, etc.)
  • experience with PostgreSQL, MySQL or MongoDB

Note for potential candidates: In Spring and Fall 2020, two students have already been working on this project, but work and research perspectives are far from being exhausted and many things remain to be explored. We therefore propose to pursue, and the focus of the next project’s edition will be adapted according to the candidate’s background and preferences. Do not hesitate to get in touch!

Type of project: Semester

Supervisors: Rémi Petitpierre (IAGS), Paul Guhennec (DHLAB)

Context: The Bibliothèque Historique de la Ville de Paris digitised more than 700 maps covering in detail the evolution of the city from 1836 (plan Jacoubet) to 1900, including the famous Atlases Alphand. They are of a particular interest in the urban studies of Paris, which was at the time heavily transfigured by Haussmanian transformations. For administrative and political reasons, the City of Paris did not benefit from the large cadastration campaigns that occurred throughout Napoleonic Europe at the beginning of the 19th century. Therefore the Atlas’s sheets are the finest source of information available on the city over the 19th century. In order to make use of the great potential of this dataset, the information contained in the maps must be converted from its visual representation to an abstract, geometrical, form, on which to base quantitative studies. This conversion from an input raster to an output vector geometry is typically well-performed by combining Convolutional Neural Networks for pixelwise classification to standard computer vision techniques. In a second time, the student will develop quantitative techniques to investigate the transformation of the urban fabric.

Tasks

  • Semantically segment the Atlas de Paris maps as well as the 1900 plan, train a pre-existing semantic segmentation neural network on a manually annotated training set, evaluate its success, and fine-tune it for it to be applicable on the entire dataset.
  • Develop a semi-automatic pipeline to align the vectorised polygons on a geographic coordinate system.
  • Depending on the student’s interest, tackle some questions such as:
    • analysing the impact of the opening of new streets on mobility within the city;
    • detecting the appearing of locally aligned neighbourhoods (lotissements);
    • investigating the relation between the unique Parisian hygienist “ilot urbain” and the housing salubrity
    • studying the morphology of the city’s infrastructure network (e.g. sewers)

Prerequisites
Good skills with Python.
Bases in machine learning and computer vision are a plus.

More explorative projects 

For students who want to explore exciting uncharted territories in an autonomous manner. 

Type of project : Master thesis
Supervisors: Frédéric Kaplan
Project summary: This explorative project studies the property of text kernels, sentences that are invariant after automatic translation back and forth into a foreign language. The goal is to develop a prototype of text editor / transformation pipeline permitting to associate any sentence with its invariant form. 

Contact: Prof. Frédéric Kaplan

Type of project : Master thesis
Supervisors: Frédéric Kaplan
Project summary: This project aims at studying the potential of Transformers architecture for new kinds of language games between artificial agent and evolution of artificial languages. Artificial agents interact with one another about situation in the “world” and autonomously develop their own language on this basis. This project extends a long series of experiment that started in the 2000s. 

Contact: Prof. Frédéric Kaplan

Type of project : Master/Semester thesis
Supervisors: Frédéric Kaplan
Project summary: This explorative project consists in mining a large collection of novels to automatically extract characters and create an automatically generated dictionary of the novel characters. The characters will be associated with the novels in which they appear, with the characters they interact with and possibly with specific traits. 

Contact: Prof. Frédéric Kaplan

Fall 2022

Project type: Master

Supervisors: Maud Ehrmann and Simon Clematide (UZH)

Context:  The impresso project aims at semantically enriching 200 years of newspapers archives by applying a range of NLP techniques (OCR correction, named entity processing, topic modeling, text reuse, etc.). The source material comes from Swiss and Luxembourg national libraries and corresponds to the fac-similes and OCR outputs of ca. 200 newspaper titles in German and French. 

Problems

  • Automatic processing of these sources is greatly hindered by the sometimes low quality of legacy OCR and OLR (when present) processes: articles are incorrectly transcribed and incorrectly segmented. This has consequences on downstream text processing (e.g. a topic with only badly OCRized tokens). One solution is to filter out elements before they enter the text processing pipeline. This would imply recognizing specific parts of newspaper pages known to be particularly noisy such as: meteo tables, transport schedules, cross-words, etc.
  • Additionally, besides advertisement recognition, OLR does not provide any section classification (is this segment a title banner, a feuilleton, an editorial, etc.) and it would be useful to provide basic section/rubrique information.

Objectives:  Building on a previous master thesis which explored the interplay between textual and visual features for segment recognition and classification in historical newspapers (see project description, master thesis, and published article), this master project will focus on the development, evaluation, application and release of a documented pipeline for the accurate recognition and fine-grained semantic classification of tables.

Tables present several challenges, among others:

  • as usual, difference across time and sources;
  • visual clues can be confusing:
    • presence of mixed layout: table-like part + normal text
    • existence “quasi-tables” (e.g. lists)
  • variety of semantic classes: stock exchanges, TV/Radio program, transport schedules, sport results, events of the day, meteo, etc.

Main objectives will be:

  1. Creation and evaluation of table recognition and classification models.
  2. Application of these models on large-scale newspaper archives thanks to a software/pipeline which will be documented and released in view of further usage in academic context. This will support the concrete use case of specific dataset export by scholars.
  3. (bonus) Statistical profile of the large-scale table extraction data (frequencies, proportion in title/pages, comparison through time and titles).

Spring 2021

Type of project: Semester 

Supervisors: Albane Descombes, Frédéric Kaplan.

Description:

Photogrammetry is a 3D modelling technique which enables making highly precise models of our built environment, thus collecting lots of digital data on our architectural heritage – at least, as it remains today.

Over the years, a place could have been recorded by drawing, painting, photographing, scanning, depending on the evolution of measuring techniques.

For this reason, one has to mix various media to show the evolution of a building through the centuries. This project proposes to study the techniques which enable to overlay images over 3D photogrammetric models of Venice and of the Louvre museum. The models are provided by the DHLab, and were computed in the past years. The images of Venice come from the photo library of the Cini Foundation, and the images of Paris can be collected on Gallica (the digital French Library).

Eventually this project will deal with the issues of perspective, pattern and surface recognition in 2D and 3D, customizing 3D viewers to overlay images, and showcasing the result on a web page.

Example of image and photogrammetric overlay.

Type of project: Master

Supervisors: Albane Descombes, Frédéric Kaplan, Alain Dufaux.

Description:
 
Fréquence Banane is one of the oldest student’s associations of UNIL-EPFL campus, thus has collected plenty of audio content over the years on various media : recording tapes, CDs, hard disks, NAS.

A collection rich of hundreds of magnetic tapes contains the association’s first radio shows, recorded in the early 90’s after its creation. They include interviews and podcasts about Vivapoly, Forum EPFL or Balélec, which set the pace of every students’ life on campus since many years.

This project aims at studying the existing methods for digitizing magnetic tapes in the first place, then building a browsable database with all the digitized radio shows. The analysis of this audio content will be done using adapted speech recognition models.

This project is done in collaboration with Alain Dufaux, from the Cultural Heritage & Innovation Center.

Type of project : Master/Semester thesis
Supervisors: Paul Guhennec, Fabrice Berger, Frédéric Kaplan
Semester of project: Spring 2021
Project summary: The project consists in the development of a scriptable pipeline for producing procedural architecture using the Houdini 3D Procedural  environment. The project will start from existing Procedural models of the Venice in 1808 developed by at the DHLAB and automatise a pipeline to script the model out historical information recorded about each parcel. 
Contact: Prof. Frédéric Kaplan

Type of project : Master thesis
Supervisors: Maud Ehrmann, Frédéric Kaplan
Semester of project: Spring 2021
Project summary:  Thanks the digitisation and transcription campaign conducted during the Venice Time Machine, a digital collection of secondary sources is offering a arguably complete covering about all the historiography concerning Venice and its population at the 19th century. Through a manual and automatic process, the project will identify a series of hypotheses concerning the evolution of Venice functions, morphology and proprietaries network. These hypotheses will be translated in a formal language, keeping a direct link with the books and journals where they are expressed and systematically tested against the model of the city of Venice established through the integration of the models of the cadastral maps. In some cases, the data of the computational model of the city may go in contradiction with the hypotheses of the database and this will lead either to a revision of the hypotheses or a revision of the computational models of Venice established so far.

Type of project : Master/Semester thesis
Supervisors: Didier Dupertuis, Paul Guhennec, Frédéric Kaplan
Semester of project: Spring 2021
Project summary: As part of the ScanVan project, the city of Sion has been digitised in 3D. The goal is now to generate images of the city by days and nights and for the different seasons (summer, winter, spring and autumn) of the city. Contrastive Unpaired Translation architecture like the one used for transforming Venice images will be used for this project. 

Contact: Prof. Frédéric Kaplan

Type of project : Master/Semester thesis
Supervisors: Albanes Descombes, Frédéric Kaplan
Semester of project: Spring 2021
Project summary: This project aims at automatically transforming YouTube videos into 3D models using photogrammetry techniques. It extends the work of several Master / Semester projects that have made significant progress in this direction. The goal here is to design a pipeline that permits to georeference the extracted models and explore them with a 4D navigation interface developed at the DHLAB. 

Contact: Prof. Frédéric Kaplan

Here is a list of master and semester projects currently proposed at the DHLAB. For most projects, descriptions are initial seeds and the work can be adjusted depending on the skills and the interests of the students. For a list of already completed projects (with code and reports), see this GitHub page.

  • Are you interested in a project listed below and it is marked as available?  Write an email to the person(s) of contact mentioned in the project description, saying in which section and year you are, and possibly including a statement of your last grades.
  • You want to propose a project or are interested by the work done at the DHLAB? Write an email to Frédéric Kaplan and Maud Ehrmann, explaining what you would like to do.

Spring 2026

Type: MSc, BA Semester project, or Master project 
Sections: Digital Humanities, Data Science, Computer Science
Supervisor: Alexander Rusnak
Number of students: 1-2

Context

As part of a SwissAI grant, we will be training a foundation scale model for open-vocabularly 3D scene understanding – with a specific focus on very large scenes of cities or culturally important buildings. For further details on the technical basis for this project, see the references section.

As part of this project, there will be substantial work which needs to be done preparing dataset and testing pipelines, as well as validating ablations on the primary model / datasets.

Furthermore, this large dataset we are constructing will offer an opportunity to a motivated student to work on extending the Visual Geometry Grounded Transformer paradigm to direct egocentric semantic understanding.

Objective

  • Support the training of a large model for open-vocabulary panoptic segmentation of city-scale 3D models.
  • Research neural reconstruction methods for egocentric semantic understanding using the associated dataset.

Research Questions

  • Can synthetic data derived from large point clouds increase the capability of neural reconstruction methods?
  • What is the limit for 3D-scene scale for these reconstruction models?
  • Can the general 3D representations be easily transferred to predict multimodal semantic vectors?

Main Steps

  • Assist in dataset pipelining and implementation for our larger project
  • Implement VGGT and train an adapter for semantic understanding on public dataset (i.e. scannet or similar)
  • Test VGGT reconstruction / semantic degradation as a factor of scene size

References

Type: MSc, BA Semester project, or Master project 
Sections: Digital Humanities, Data Science, Computer Science
Supervisor: Alexander Rusnak
Number of students: 1-2

Context

As part of a SwissAI grant, we will be training a foundation scale model for open-vocabularly 3D scene understanding – with a specific focus on very large scenes of cities or culturally important buildings. For further details on the technical basis for this project, see the references section.

As part of this project, there will be substantial work which needs to be done preparing dataset and testing pipelines, as well as validating ablations on the primary model / datasets.

Furthermore, this large dataset we are constructing will offer an opportunity to a motivated student to work on building hierarchical scene representations which facilitate more granular understanding for search and robotic interaction.

Objective

  • Support the training of a large model for open-vocabulary panoptic segmentation of city-scale 3D models.
  • Research hierarchical scene representation methods based on point clouds, gaussian splats, or other primitives.

Research Questions

  • Can synthetic data derived from large point clouds increase the capability of neural reconstruction methods?
  • What is the limit for 3D-scene scale for these reconstruction models?
  • Can the general 3D representations be easily transferred to predict multimodal semantic vectors?

Main Steps

  • Assist in dataset pipelining and implementation for our larger project
  • Investigate various techniques for creating object-part hierarchies in 3D data, particularly those derived directly from per-3D-point open-vocabulary semantic labels
  • Implement multiple forms of representations of these hierarchies to test their efficacy, particularly when utilizing hyperbolic embeddings

References

Requirements: Python, Pytorch, Open3D

Type: MSc, BA Semester project, or Master project 
Sections: Digital Humanities, Data Science, Computer Science
Supervisor: Alexander Rusnak
Number of students: 1-2

Context

open-vocabularly 3D scene understanding – with a specific focus on very large scenes of cities or culturally important buildings. For further details on the technical basis for this project, see the references section. 

As part of this project, there will be substantial work which needs to be done preparing dataset and testing pipelines, as well as validating ablations on the primary model / datasets.

Furthermore, the models and datasets we are constructing will offer an opportunity to a motivated student to build cutting-edge user interfaces to enable interaction with and manipulation of large-scale 3D scenes.

Objective

  • Support the training of a large model for open-vocabulary panoptic segmentation of city-scale 3D models.
  • Research optimal graphical user interface design for interaction with large scale 3D scenes with multimodal querying and associated multimodal datasets (historical documents, sustainability data, civil engineering data, etc). 

Research Questions

  • How can we resolve multimodal queries and surface coherent results in a coherent manner for users? 
  • How can we facilitate “3D native” querying (i.e. subselecting parts of a 3D scene and using it to search for other semantically similar 3D components or multimodal results from an associated dataset)
  • How can we integrate 3D search with a chat interface and MLLM? What are the possibilities for a RAG-esque chat functionality but enabled by 3D data?

Main Steps

  • Assist in dataset pipelining and implementation for our larger project
  • Build an integrated front-end with the TimeAtlas beta for 3D native querying (https://timeatlas.eu/)
  • Research optimal multimodal querying techniques and visualizations
  • Explore retrieval augmented generation (RAG) with multimodal datasets and the interaction of MLLM outputs with city-scale point clouds

References

Requirements: Python, Pytorch, Open3D

Type: MSc, BA Semester project, or Master project 
Sections: Digital Humanities, Data Science, Computer Science
Supervisor: Frederic Kaplan
Number of students: 1-2

Context

This project is part of a collaborative research partnership between DHLAB and Flickr Foundation. Flickr Commons hosts millions of historical images contributed by cultural heritage institutions. Alongside the images, users generate “social metadata” such as tags, comments, and curated galleries. This community-produced information often contains contextual knowledge (identifications, places, events, interpretations) that is missing from institutional catalogues.

Recent image-to-text AI models can automatically generate descriptions, but these typically rely only on visual input and overlook existing social context. This project explores whether combining visual models with social metadata can lead to richer, more accurate descriptions of archival images.

Objective

The objective of the project is to develop and evaluate prototype methods for social metadata-enhanced description of archival images. 

Research Questions

  • How do current image description models perform on historical photographs from Flickr Commons?
  • What types of contextual information appear in social metadata but are missing from model outputs?
  • Can social metadata be used to improve the accuracy, specificity, or cultural relevance of generated descriptions?
  • What are the limitations or risks (e.g., bias, misinformation) of incorporating community-generated metadata?

Dataset

While Flickr contains tens of billions of user-uploaded images, the project will focus on the Flickr Commons collection for this bounded study. This collection offers several advantages:

  • Pre-vetted content: images have been curated by trusted institutional partners of the Commons, reducing the risk of encountering harmful material
  • All images are designated with No Known Copyright Restrictions
  • Many images contain social metadata accumulated over years of community engagement on Flickr

Main Steps

  • Literature and tool review: survey existing image-to-text systems and prior work on social metadata in GLAM collections (Flickr has already done studies).
  • Dataset preparation: select a representative subset of Flickr Commons images and collect associated social metadat.
  • Baseline evaluation: generate descriptions using state-of-the-art models and assess their performance.
  • Metadata integration: design and implement methods to incorporate tags, comments, or galleries into the description process.
  • Quantitative + qualitative evaluation:  compare baseline and enhanced descriptions with respect to completeness, accuracy, and archival usefulness.
  • Documentation and project report: write up findings and prepare recommendations for cultural heritage applications.

References

Recent archival initiatives have included: FLAME (2024-25), PAAG (2023-24), Harvard Art Museums AI Explorer (2016-present), Rijksmuseum x Microsoft Azure (2023), Heritage Connector (2020-21) and SherlockNet (2016). These examples show how generated descriptions can be beneficial to improving accessibility and discoverability in archives and collections management. 

Requirements

  • Experience with Python and working with large multimodal APIs
  • Familiarity with machine learning and multimodal AI models
  • Ability to design and evaluate experiments
  • Interest in digital cultural heritage, archives, or social metadata
  • Awareness of ethical considerations in AI and GLAM contexts

Fall 2025

Type: MSc (12 ECTS) semester project 
Sections: Digital Humanities, Data Science, Computer Science
Supervisor: Hamest Tamrazyan, Emanuela Boros
Number of students: 1

Context

Large Language Models (LLMs) exhibit significant challenges in capturing culture-specific meaning, especially for low-resource languages like Armenian or Ukrainian. Research shows that LLMs are not individuals but rather superpositions of cultural perspectives , and their outputs risk cultural erasure by oversimplifying or omitting diverse cultural realities.

Moreover, while methods like CultureLLM attempt to incorporate cultural diversity through data augmentation, a gap remains in fine-grained annotation of psychosocial dimensions of meaning within bilingual corpora.

This project aims to create structured, culturally aware annotations to support both the evaluation and improvement of LLMs and MT systems.

Objective

  • Develop a bilingual dataset (e.g., Armenian-English) focused on culturally embedded expressions.
  • Apply the Psychosocial Categorisation Model (PSCM) to annotate literal, categorical, emotional, and contextual meanings.
  • Investigate statistical signals (e.g., valence shifts, collocation patterns) that identify culture-loaded expressions.
  • Evaluate LLM/MT performance changes with exposure to culturally annotated data.
  • Contribute to the broader effort of resisting cultural erasure in AI systems.

Research Questions

  • What linguistic features statistically signal cultural and semantic density across bilingual corpora?
  • How can PSCM-based annotation systematically capture culture-specific meanings beyond literal translation?
  • Can embedding-based, statistical, and content analysis methods automatically assist in selecting candidates for cultural annotation?
  • Does exposure to PSCM-annotated data improve LLM/MT outputs for low-resource languages, or explain unexpected failures (e.g., perspective shifts)?

Main Steps

  • Literature review: Cognitive semantics, cultural linguistics, LLM evaluation, cultural bias in AI .
  • Data selection: Choose human- and machine-translated bilingual texts rich in cultural material (idioms, folklore, social discourse).
  • Statistical analysis:
    • Valence and emotional scoring.
    • Collocation strength and frequency shifts.
    • Semantic clustering (using embeddings).
  • Candidate selection: Identify units for PSCM annotation using statistical signals and manual verification.
  • PSCM schema design: Define annotation guidelines for literal, categorical, emotional, and contextual levels.
  • Manual annotation: Apply the PSCM schema to selected data; refine based on pilot annotations.
  • Similarity and divergence analysis: Use embedding-based methods to measure shifts between human, machine, and culturally annotated data.
  • LLM/MT evaluation:
    • Compare model outputs with baseline vs. PSCM-enriched prompts.
    • Analyse unexpected perspective shifts and cultural omissions.
    • Analysis of cultural meaning retention: Interpret how models succeed or fail to represent cultural semantics.
  • Reporting: Deliver annotated dataset, analysis results, and recommendations for culturally-aware NLP development.

References

Requirements

  • Background in NLP, data science, or computational linguistics.
  • Skills in Python and common NLP libraries (spaCy, NLTK, sklearn, HuggingFace).
  • Preferably knowledge in LLM interpretability.
  • Knowledge of basic annotation practices; familiarity with tools like Prodigy, Doccano, or custom scripts.
  • Understanding of cultural linguistics, cognitive semantics, or interest in psycholinguistics.
  • (Optional) Knowledge of Armenian or Ukrainian — otherwise translation and interpretive support will be provided.

Type: MSc (12 ECTS) Semester project 
Sections: Digital Humanities, Data Science, Computer Science
Supervisor: Hamest Tamrazyan, Emanuela Boros
Number of students: 2

Context

This project aims to explore, retrieve, and analyze data from various sources to monitor and understand Russia’s manipulation and weaponisation of cultural heritage in Armenia. By employing data analysis techniques, this project seeks to document and provide insights into these actions, contributing to the preservation of cultural heritage and supporting international awareness and policy-making.

Objective and (Possible) Main Steps:

One can choose to analyse Wikipedia to:

  • Track Editing Histories: The revision history of contentious articles may reveal politically or ideologically motivated edits (i.e., articles about Armenian cities or cultural artifacts)
  • Automated Page Tracking: Use tools like Wikimedia’s API to monitor changes in articles about cultural heritage in real-time.
  • Cross-Check Narratives: Compare Wikipedia content with scholarly sources and publications from multiple perspectives.
  • Investigate Talk Pages: The discussions on an article’s talk page often reveal disputes and biases.
  • Web Scraping for Content and Metadata: Use scraping libraries like BeautifulSoup or scrapy to collect article text, editor information, and metadata not available through the API.
  • Revision Analysis:

    • Compare successive revisions of articles using diff algorithms (e.g., difflib) to detect content additions, deletions, or modifications.

    • Highlight changes in sentiment, bias, or framing.

  • Semantic Page Selection: Employ embedding models like Alibaba-NLP/gte-multilingual to identify articles with semantic relevance to “cultural heritage” or “cultural manipulation.”

Requirements: Excellent Python knowledge, scraping, large language models knowledge

Significance:

This project will contribute to understanding cultural heritage manipulation in Armenia, providing valuable insights and digital resources for future academic and cultural research. It will also contribute to the creation of strategies to protect cultural heritage in conflict zones. The methodologies developed in this project can also be applied to other conflict areas, enhancing global efforts to safeguard cultural heritage.

Taken

Type: MSc (12 ECTS) semester project 
Sections: Digital Humanities, Data Science, Computer Science
Supervisor: Hamest Tamrazyan, Emanuela Boros
Number of students: 1

Context

The Corpus of Armenian Inscriptions is a foundational printed resource documenting Armenian epigraphic heritage. Its digitization and encoding in EpiDoc TEI/XML format will enhance accessibility, interoperability, and preservation. However, manual encoding is time-consuming. This project aims to develop a semi-automated pipeline to extract structured data from the PDF corpus and generate valid EpiDoc TEI/XML files.

Objective

  • To build a computational workflow that extracts relevant metadata and texts from the Armenian inscriptions corpus PDF and produces EpiDoc-compliant TEI/XML files according to EpiDoc guidelines and schemas.

Research Questions

  • How can textual and metadata content be automatically extracted from a scanned or born-digital PDF of Armenian inscriptions?
  • What natural language processing or rule-based techniques are effective for identifying epigraphic metadata in Armenian?
  • How can the extracted information be mapped to the EpiDoc TEI/XML schema to produce valid, reusable digital editions?
  • What are the limitations and accuracy challenges posed by OCR and automated extraction in this context?

Main Steps

  • Analyze the PDF corpus to determine the nature of its content (text layer vs. scanned images).
  • Apply OCR (using Armenian OCR tools like Calfa hye-tesseract) if necessary, and clean the extracted text.
  • Segment the text into individual inscription entries and identify key metadata fields (e.g., provenance, material, date, language, transcription).
  • Develop rule-based or NLP methods to extract structured information.
  • Design and implement a script to generate valid EpiDoc TEI/XML files from the extracted data, following EpiDoc schema and templates.
  • Validate generated XML files against the EpiDoc schema.
  • Document the process, challenges, and provide sample output files.

References

  • EpiDoc Guidelines and Schema: https://epidoc.sf.net/
  • Calfa hye-tesseract OCR: https://github.com/Calfa/hye-tesseract
  • TEI Consortium, TEI Guidelines: https://tei-c.org/release/doc/tei-p5-doc/en/
  • Relevant publication on Armenian epigraphy (http://serials.flib.sci.am/openreader/vimagrutyun_5/book/content.html)
  • Python libraries: pdfminer.six, lxml, spaCy (or other NLP tools)

Requirements

  • Proficiency in Python programming
  • Basic knowledge of XML, TEI.

Type: Master thesis project (30 ECTS)
Sections: Computer Science, Data Science, Digital Humanities
Supervisors: Frédéric Kaplan, Emanuela Boros, Maud Ehrmann
Number of students: 1

Context: The Impresso Project enriches large collections of radio and newspaper archives using image and text processing techniques. Among many processings that are applied to a dataset of over 130 digitised historical newspapers containing millions of pages and images (15B tokens), traditional named entity recognition and linking is applied. While location names are recognised and linked to Wikidata, a crucial dimension that is still missing is to accurately georeference relevant location names, in order to enable the integration of the spatial dimension to the temporal one.

Objective. This project aims to scale multilingual location detection and georeferencing across the Impresso corpus, with a particular focus on sub-city levels.

Main Steps

  • Familiarising yourself with the Impresso project and data.
  • Literature Review: Explore existing research on multilingual location detection and georeferencing.
  • Data Analysis: Examine location names already recognised in the corpus, analyzing their statistical profiles, common errors, and areas for improvement, particularly at the sub-city level. Additionally, identify which location entities in historical newspapers are most relevant for mapping purposes.
  • System Implementation: Develop or adapt a system for fine-grained location name recognition and linking. Various directions could be followed.
  • Relevance Filtering: Design a method to determine which recognised place names are meaningful and should be georeferenced.
  • Evaluation: Assess the system’s performance using appropriate metrics and benchmarks.
  • Application: Deploy the system on the entire Impresso corpus.

The student can leverage tools such as the T-Res library, DeezyMatch, the HIPE entity evaluation pipeline, and the TopRes19th dataset.

Requirements: Knowledge of machine learning (ML) and deep learning (DL), familiarity with natural language processing, proficiency in Python, experience with a DL framework (preferably Pytorch), interest in historical data.

A few references:

  • Ardanuy, M. C., Nanni, F., Beelen, K., & Hare, L. (2023). The past is a foreign place: Improving toponym linking for historical newspapers. Proceedings http://ceur-ws. org ISSN, 1613, 0073.
  • Meijers, E., & Peris, A. (2019). Using toponym co-occurrences to measure relationships between places: Review, application and evaluation. International Journal of Urban Sciences, 23(2), 246-268.

Type: BA or MSc (8-12 ECTS) semester project
Sections: Computer Science, Data Science, Digital Humanities
Supervisors: Maud Ehrmann, Pauline Conti, Emanuela Boros
Number of students: 1

Context: The Impresso Project features a dataset of around 135 digitised historical newspapers containing approximately 4 million images. These images have no labels, and only 10% of them have a caption, two aspects that hinder their retrieval.

Objective. The objective of this project is to generate informative and accurate textual descriptions (ideally in English, French and German) of these images and determine an evaluation method to assess the quality of the descriptions. This process is reffered to as image captioning. Descriptions can have the format of captions or be slightly longer.

Challenges: 1) images extracted from historical newspapers spanning 200 years, being therefore of very different quality 2) what images represent is very diverse, and we would like high-quality descriptions throughout the diversity of topics and styles.

Main steps: The project could follow the following steps:  

  1. Investigating which large vision-language models are good candidates for the task. Examples include Flamingo, Paligemma (Google), BLIP (Salesforce), CLIP-VIT (OpenAI), GIT, Florence, and Phi (Microsoft), LLama3v; 
  2. Building a test dataset from images (of a certain type) that already have a caption and from images that do not have a caption;
  3. Determining an evaluation method. Basically, what do we define as a good description? Criteria could be: language correctness; accuracy, i.e. the caption provides a correct description of the image; informativeness, i.e. the caption provides elements of information that are useful and interesting; ‘texture’/tone: the tone of the caption is adapted to the image; for those images that already have captions: how close are the generated ones to the original ones.
  4. Determining a baseline and a series of experiments that make sense given the context of the project and the historical nature of the material
  5. Analysing the results and drawing conclusions on what works best in which setting for this type of material.
  6. If the project is accepted as a Msc semester project, an additional step for finetuning (training) an LLM for image captioning might be considered.

Additional material:

  • a dataset of 7200 images annotated with their types (drawing, map, photo, graphs, etc.)
  • for each image, pre-computed embeddings from four different models.

Requirements: Knowledge of machine learning (ML) and deep learning (DL), familiarity with natural language processing, proficiency in Python, experience with a DL framework (preferably Pytorch).

Type: MSc (12 ECTS) semester project
Sections: Computer Science, Data Science, Digital Humanities
Supervisors: Alexander Rusnak
Number of students: 1

Context: There is a current paradigm in 3D scene understanding leveraging vision models, particularly ones that produce semantic vectors like CLIP, on the images that are used to create 3D scans before projecting the derived features onto the 3D scene structure.

However, for a lot of “in the wild” 3D scenes or data, the images from which the scene was derived are unavailable – but it is key to creating a foundation scale model to be able to use these 3D scenes for training data. In order to use this same sort of projection approach on a point cloud (or mesh), you have to take synthetic images of the 3D model. But for many scenes, especially where the points are a bit more sparse, the images look like pictures of a point cloud and not a totally realistic image.  So when applying a VLM which has been trained only on natural images, the semantic vector reflects this (i.e. with the features from a picture of a point cloud of a chair, the closest text in the embedding space would be  “a point cloud of a chair” but for optimal projection performance the vector should just represent “a chair”).

Objective: Evaluate the performance differential between real and synthetic images of 3D scenes with base VLMs and then finetune the VLMs to improve their performance on synthetic images. 

Research Questions:

  • Which VLMs are most effective out of the box on synthetic images? 
  • How much can we improve their performance on synthetic scenes? 
  • What is the best way to develop pixel-level features (cropping around segments or using a default pixelwise encoder)?

Main Steps:

  1. Take some 3D datasets which also have the natural images and the poses of the images.
  2. Take synthetic images of the point cloud from the same poses as the natural images.
  3. Finetune some VLM(s) using these paired natural and synthetic images to minimize the distance between the two embeddings.
  4. Publish a paper about the results / open-source the best model on huggingface

References:

Requirements: Python, Pytorch, Open3D, etc

Type: MSc (12 ECTS) semester project
Sections: Computer Science, Data Science, Digital Humanities
Supervisors: Alexander Rusnak
Number of students: 1

Context: 

There is quite a bit of research around image aesthetic assessment (IAA) but not as much around assessment of 3D data even though it likewise often represents data which is often highly aesthetically oriented (i.e. sculptures, cathedrals, civic buildings, etc) and the aesthetic quality of the built landscape seems to have a strong effect on the wellbeing of the inhabitants.

The goal of the project would be to take a 3D dataset and build a pseudo-labeling pipeline to create aesthetic labels for the various points, and then train a model to predict these labels just from the point cloud structure. The labeling pipeline would likely capture images of the 3D scene before using an IAA model / vision foundation model to produce pixel wise labels before projecting them back into the 3D scene.

Objective: Develop a model for the 3D aesthetic assessment of scenes and objects at point or superpoint level granularity.

Research Questions:

  • How capable are projected VLM features of capturing abstract categories like beauty in 3D scenes? 
  • Are IAA more accurate for this task than generalists VLMs? 
  • What is the best way of creating pixelwise features from whole image features?
  • How effective is distillation of these features from a model which evaluates 3D structure? Does this model work truly out of sample i.e. does a model trained on sculptures works on buildings and vice versa.

Main Steps:

  1. Literature review on IAA and 3DAA
  2. Project features from VLMs and bespoke IAA models into 3D scenes and evaluate their agreement. 
  3. Distill a model for predicting these features.
  4. Test this model on out of domain 3D data to applicability of 3D structure aesthetics across various objects / scenes.

References:

Requirements: Python, Pytorch, Open3D, etc

Type: MSc (12 ECTS) semester project
Sections: Computer Science, Data Science, Digital Humanities
Supervisors: Alexander Rusnak
Number of students: 1

Context: Recent advances in mechanistic interpretability have substantially increased the clarity of the internal reasoning of large transformer based neural networks, allowing researchers to disentangle steps of conceptual reasoning at high levels of abstraction. In order to unlock the feature graphs which make this sort of analysis possible, it is necessary to replace certain layers within the LLM with a connected transcoder architecture and train this new system to replicate the behaviour of the LLM while attached to the remaining frozen layers. This novel approach has thus far not been applied to many different models, or in a multimodal context.

Objective: Train a transcoder replacement model to replicate the behaviour of a vision-language model, then use it to perform various forms of analysis on the internal reasoning of the VLM.

Research Questions:

  • Is it possible to train a transcoder to replicate the behaviour of a VLM?
  • Does the VLM encode modality agnostic representations of concepts in an analogous way to multilingual conceptual features?
  • Can we use the transcoder features to determine logical reasoning steps on the analysis of images, and in particular, images of text heavy documents?

Main Steps:

  1. Literature review and model selection
  2. Implement layer replacement with transcoder within candidate model or models.
  3. Run training of the transcoder replacement model.
  4. Use modified model to identify modality-unified features if possible. 
  5. Use modified model to test reasoning steps on text heavy documents.

References:

Requirements: Python, Pytorch, etc

Type: MSc (12 ECTS) semester project 
Sections: Digital Humanities, Data Science, Computer Science
Supervisor: Hamest Tamrazyan, Emanuela Boros
Number of students: 1–2

Context

This project aims to explore, retrieve, and analyse data from the resources, focusing on a select collection of books related to epigraphy and cultural heritage. The primary objective is to gain insights into Armenian epigraphic and cultural heritage through detailed data analysis, term extraction, and database management.

Objective

This project aims to deepen the knowledge of the architectural and epigraphic significance of the church, explore innovative techniques for digitizing and visualizing cultural heritage, and contribute to the preservation and accessibility of Armenian inscriptions in Nagorno-Karabakh.

Objectives and Main steps

  1. Data Retrieval: Collect and aggregate data from academic resources, specifically targeting books and resources about epigraphy and cultural heritage.
  2. Data Cleaning and Formatting: Implement data preprocessing techniques to ensure data quality. This includes removing irrelevant or corrupt data, handling missing values, and standardizing formats.
  3. Database Setup: Design and implement a database to store and manage retrieved data efficiently. The database should allow easy access and manipulation of the data for analysis.
  4. Term Extraction and Analysis: Employ natural language processing (NLP) techniques to extract key terms, concepts, and thematic elements from the texts. This will help us understand the predominant themes and patterns in Ukrainian epigraphy and cultural heritage.

Requirements

  • Proficiency in Python, knowledge of NLP techniques.

Significance

This project will contribute to the understanding of Armenian’s rich cultural heritage. It will provide valuable digital resources for future academic and cultural research in this field.

Spring 2025

Taken

Type: MSc (12 ECTS) Semester project 
Sections: Computer Science, Data Science, Digital Humanities
Supervisor: Alexander Rusnak
Number of students: 1

Context: The field of AI ethics has become increasingly relevant as language models have proliferated into the public sphere. 

Objective: Find novel ways of quantifying normative ethics and persistence of ethical frameworks across various scenarios. 

Type of project: Semester

Supervisor: Maud Ehrmann

Objective: Given a large archive of historical newspapers (cf. the impresso project) containing both text and image material, the objective of this project is to :

1/ Learn a model for binary image classification: map vs. non-map.
This first step will consist in:

  • the annotation of a small training set (this step is best done in collaboration with project on image classification);
  • the training of a model by fine-tuning an existing visual model;
  • the evaluation of the said model.

UPDATE: an annotated dataset already exists.

2/ Learn a model for map classification (which country or region of the world is represented)

  • first exploration and qualification of map types in the corpus.
  • building of a training set, prob. with external sources
  • the training of a model by fine-tuning an existing visual model;
  • the evaluation of the said model.

Required skills:

  • basic knowledge in computer vision
  • ideally experience with PyTorch

Type: MSc (12 ECTS), BA (8ECST) Semester project 
Sections: Computer Science, Data Science, Digital Humanities
Supervisor: Paul Guhennec
Number of students: 1–2 (or more – together or separate)

Type: MSc (12 ECTS), BA (8ECST) Semester project 
Sections: Computer Science, Data Science, Digital Humanities
Supervisor: Paul Guhennec
Number of students: 1–2 (or more – together or separate)

Type: MSc (12 ECTS), BA (8ECST) Semester project 
Sections: Computer Science, Data Science, Digital Humanities
Supervisor: Paul Guhennec
Number of students: 1–2 (or more – together or separate)

Type: MSc (12 ECTS), BA (8ECST) Semester project 
Sections: Computer Science, Data Science, Digital Humanities
Supervisor: Paul Guhennec
Number of students: 1–2 (or more – together or separate)

Type: MSc (12 ECTS) Semester project 
Sections: Digital Humanities, Data Science, Computer Science
Supervisor: Hamest Tamrazyan, Emanuela Boros
Number of students: 1–2

Context

This project aims to explore, retrieve, and analyze data from various sources to monitor and understand Russia’s manipulation and weaponisation of cultural heritage in Ukraine. By employing data analysis techniques, this project seeks to document and provide insights into these actions, contributing to the preservation of cultural heritage and supporting international awareness and policy-making.

Objective and (Possible) Main Steps:

One can choose to analyse Wikipedia to:

  • Track Editing Histories: The revision history of contentious articles may reveal politically or ideologically motivated edits (i.e., articles about Ukraine cities or cultural artifacts)
  • Automated Page Tracking: Use tools like Wikimedia’s API to monitor changes in articles about cultural heritage in real-time.
  • Cross-Check Narratives: Compare Wikipedia content with scholarly sources and publications from multiple perspectives.
  • Investigate Talk Pages: The discussions on an article’s talk page often reveal disputes and biases.
  • Web Scraping for Content and Metadata: Use scraping libraries like BeautifulSoup or scrapy to collect article text, editor information, and metadata not available through the API.
  • Revision Analysis:

    • Compare successive revisions of articles using diff algorithms (e.g., difflib) to detect content additions, deletions, or modifications.

    • Highlight changes in sentiment, bias, or framing.

  • Semantic Page Selection: Employ embedding models like Alibaba-NLP/gte-multilingual to identify articles with semantic relevance to “cultural heritage” or “cultural manipulation.”

Requirements: Excellent Python knowledge, scraping, large language models knowledge

Significance:

This project will contribute to understanding cultural heritage manipulation in Ukraine, providing valuable insights and digital resources for future academic and cultural research. It will also contribute to the creation of strategies to protect cultural heritage in conflict zones. The methodologies developed in this project can also be applied to other conflict areas, enhancing global efforts to safeguard cultural heritage.

Fall 2024

Taken

Type: MA (12 ECTS) Research project
Sections: Computer Science, Data Science, Digital Humanities
Supervisors: Tristan Karch
Number of students: 1

Context: 

The study of urban history is a complex and multidimensional field that involves analyzing various types of historical data, including cadastre (land registry) records. Traditionally, this process has been manual and time-consuming. However, with the advent of Large Language Models (LLMs) and their ability to process and analyze vast amounts of data, there is an opportunity to automate and enhance historical discoveries. Identifying divergences between present and past data is a critical starting point for many historical investigations, as it allows researchers to uncover patterns, transformations, and anomalies in the urban landscape over time.

Cadastre Data:

Cadastre data typically includes detailed information about land ownership, property boundaries, land use, and the value of properties. This data is crucial for understanding the historical layout and development of urban areas. Importantly, all data points in cadastre records are geolocalized, which facilitates direct comparison with today’s data from sources like OpenStreetMap.

Objective:

The primary objective of this project is to develop an automated system that leverages LLM agents to compare historical cadastre data with present-day data. The LLM agent would rely on a coding assistant as in [1] to efficiently convert hypotheses in natural language into python programs that efficiently make operations on tabular data.

Main Steps: 

  1. Data Collection: Gather historical cadastre data and current urban data from various sources.
  2. Preprocessing: Clean and preprocess the collected data to ensure compatibility and accuracy.
  3. LLM Integration: Integrate LLM agents to analyze and compare the historical and contemporary datasets.
  4. Analysis: Conduct a detailed analysis to identify significant changes and patterns in the urban landscape.

Additional Comparisons:

In addition to comparing cadastre data with present-day geolocalized data from OpenStreetMap, other comparisons can be envisioned. For instance, leveraging genealogical databases or other open registers from today can provide further insights into the socio-economic transformations and population dynamics over time.

References:

[1] Majumder, Bodhisattwa Prasad, Harshit Surana, Dhruv Agarwal, Sanchaita Hazra, Ashish Sabharwal, and Peter Clark. “Data-Driven Discovery with Large Generative Models.” arXiv, February 21, 2024. http://arxiv.org/abs/2402.13610.

Requirements:

Proficiency in data science and machine learning. Familiarity with LLMs and natural language processing. Experience with Langchain, Huggingface or Autogen is a plus.

Type: Master or Bachelor research project (12/8 ECTS)
Sections: Data Science
Supervisors: Pauline Conti, Maud Ehrmann 
Number of students: 1

Ideal as an optional semester project for a data science student.

Context: The Impresso project semantically enriches large collections of radio and newspaper archives by applying image and text processing techniques. A complex pipeline of data preparation and processing steps is applied to millions of content elements, creating and manipulating millions of data points.

Objective: The aim of this project is to implement a data visualisation dashboard to enable monitoring and quality control of the different data and their processing steps. Based on different sources of information, i.e. data processing manifests, inventories and statistics, the dashboard should provide an overview of what data is at what stage of the pipeline, allow a comparative view of different processing stages and support general understanding.

The solution adopted should ideally be modular and lightweight, and will ultimately be deployed online to allow everyone from the project (and perhaps more) to follow the data processing pipeline.


Steps:

  • Understanding of the different Impresso processes and data, and the associated visualisation needs
  • Detailed review of existing open-source dashboard data visualisation tools 
  • Implementation of tools, customisation to meet needs, visualisation proposals based on opportunities
  • Test/revision loop
  • Online deployment

Requirements:

Background in data science and data visualisation, basics of software engineering, good knowledge of Python, interest in data management.

Organisation of work

  • Weekly meeting with supervisor(s)
  • The student is asked to submit a detailed project plan (envisaged steps, milestones) by the end of week 2.
  • The student is advised to document his/her work in a logbook regularly and to document updates on progress, potential questions or problems in the logbook before the weekly meeting (at least 4 hours before).
  • A slack channel is used for communication outside the weekly meeting.
  • The student is advised to start his/her project report between 3 and 2 weeks before the end of the project. Report on overleaf using the EPFL template.

Type: Master thesis project (30 ECTS)
Sections: Computer Science, Data Science, Digital Humanities
Supervisors: Pauline Conti, Emanuela Boros, Maud Ehrmann
Number of students: 1

Context: The Impresso project comprises a dataset of 130 digitised historical newspapers, totalling approximately 7.5 million pages. About half of these newspapers were digitised with only optical character recognition (OCR), while the rest also underwent optical layout recognition (OLR), separating text zones (lines, paragraphs, margins) and organising and labelling them into logical units corresponding to the various areas of the page (articles, headlines, section heads, tables, footnotes, etc).

For newspapers lacking OLR, the text from different content units is not differentiated, which negatively impacts the performance of NLP tools (and often compromises their relevance when applied to mixed contents). Identifying the bounding regions of the various content areas on newspaper pages could help us disentangle their respective texts and allow for separate processing.

Source: Luxemburger Wort – June 1st 1950, page 6. An example of a newspaper page from the Impresso corpus, which had OLR and where the various content areas identified are visible with blue squares.

The objective of the project is to investigate the ability of Large Vision Models (LVMs) tp interpret the physical layout of digitised print documents, in this case historical newspaper facsimiles. Specifically, the project aims to test and evaluate different models at segmenting and labelling logical units on pages such as text-paragraph, title, subtitle, table and image (either semantic or instance segmentation). The project will benefit from existing OLRed data from the Impresso corpus, which could be sampled to create a training set. The project will address, among others, the following research questions:

  • Can LVMs (multimodal, vision-only) accurately recognise the layout of historical newspaper pages, and which approach is best suited?
  • Can the identified approach be generalised to a large-scale dataset spanning around 300 years with significant variation in layout?

Main Steps: 

  • Familiarise with the Impresso project and data to understand the specific needs
  • Review literature on document layout recognition and instance segmentation with the goal of identifying most promising approaches and recent models.
  • Programmatically create a dataset based on existing OLR data, showcasing layout variety across newspaper titles and over time.
  • Explore, apply, and evaluate selected multimodal and/or large vision model(s)
  • Depending on results and progress, potentially explore post-processing to order or group regions corresponding to the same articles or contents.

Requirements: Knowledge of machine learning (ML) and deep learning (DL), familiarity with computer vision, proficiency in Python, experience with a DL framework (preferably Pytorch), interest in historical data.

References: 

And also:

Type: MA (12 ECTS) or BA (8 ECTS) Research project
Sections: Computer Science, Data Science, Digital Humanities
Supervisors: Tristan Karch
Number of students: 1–2

Context: 

Large Language Models (LLMs) have revolutionised how we interact with and extract information from vast textual datasets. These models have been trained on extensive corpora, incorporating a broad spectrum of knowledge. However, a significant challenge arises in determining whether a given piece of text presents new information or reflects content that the model has already encountered during its training. 

Evaluating the novelty of texts is crucial for applications in historical research. As the volume of historical literature grows, particularly with the digitization of vast archives, historians face the challenge of navigating and synthesising information from these extensive datasets. 

Competent LLMs, adept at recognizing and integrating new knowledge, can significantly enhance this process. They would allow historians to uncover previously unseen patterns, connections, and insights, leading to groundbreaking historical discoveries and more robust applications in the digital humanities. 

Objective:

This project aims to develop algorithms that can effectively evaluate whether textual sources represent new pieces of knowledge that were never distilled in open-source LLMs during pre-training.

Research Questions:

  • What algorithms can be developed to assess the novelty of texts with respect to LLM training data?
  • How can these algorithms be combined with standard retrieval approaches [1] to improve them in the domain of interest. 

Main Steps:

  • Literature Review: Conduct a comprehensive review of existing approaches to novelty detection [2,3,4], hallucination detection [5]  and knowledge evaluation [6] in the context of LLMs.
  • Problem Definition: Formally define knowledge in the context of textual data: information (content) vs novel pattern of language (form)
  • Data Preparation: 
    • Novel data selection (Sources curated by EPFL – Secondary sources about Venice, EPFL thesis or new data)
    • Standard statistical analysis of data (unsupervised NLP technics)
  • Algorithm Development: Develop algorithms for novelty detection that may include:
    • Statistical comparison methods.
    • Embedding-based similarity measures.
    • Anomaly detection techniques.
    • Token / phrase-level /  chunk  analysis.
  • Testing and Evaluation: Test the developed algorithms using various datasets to assess their accuracy and effectiveness in identifying novel texts. This includes:
    • Benchmarking against known LLM training data.
    • Evaluating performance across different genres and languages.

References

[1] Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, Douwe Kiela. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks –  NeurIPS 2020.

[2] Shi, Weijia, Anirudh Ajith, Mengzhou Xia, Yangsibo Huang, Daogao Liu, Terra Blevins, Danqi Chen, and Luke Zettlemoyer. “Detecting Pretraining Data from Large Language Models.” arXiv, March 9, 2024. http://arxiv.org/abs/2310.16789.

[3] Golchin, Shahriar, and Mihai Surdeanu. “Data Contamination Quiz: A Tool to Detect and Estimate Contamination in Large Language Models.” arXiv, February 10, 2024. http://arxiv.org/abs/2311.06233.

[4] Hartmann, Valentin, Anshuman Suri, Vincent Bindschaedler, David Evans, Shruti Tople, and Robert West. “SoK: Memorization in General-Purpose Large Language Models.” arXiv, October 24, 2023. https://doi.org/10.48550/arXiv.2310.18362.

[5] Farquhar, Sebastian, Jannik Kossen, Lorenz Kuhn, and Yarin Gal. “Detecting Hallucinations in Large Language Models Using Semantic Entropy.” Nature 630, no. 8017 (June 2024): 625–30. https://doi.org/10.1038/s41586-024-07421-0.

[6] Wang, Cunxiang, Sirui Cheng, Qipeng Guo, Yuanhao Yue, Bowen Ding, Zhikun Xu, Yidong Wang, Xiangkun Hu, Zheng Zhang, and Yue Zhang. “Evaluating Open-QA Evaluation.” arXiv, October 23, 2023. https://doi.org/10.48550/arXiv.2305.12421.

Requirements: Good programming skills, a strong interest for LLM research, experience with LLMs is a plus.

Type: MA Research project or MA thesis
Sections: Computer Science, Data Science, Digital Humanities
Supervisors: Pauline Conti, Emanuela Boros, Maud Ehrmann
Number of students: 1

Excerpt of the Swiss Press Bibliography of Fritz Blaser (categories on top, entries at the bottom)


The Impresso project applies natural language and computer vision processing techniques to enrich large collections of radio and newspaper archives and develops new ways for historians to explore and use them. Exploring, analysing and interpreting historical media sources and their enrichment is only possible with contextual information about the sources themselves (e.g. what is the political orientation of a newspaper), and the processes applied to them (what is the accuracy of the tools that produced this or that enrichment).

Information on newspapers, or metadata, already exists but can always be supplemented. In this respect, the “Swiss Press Bibliography“, published by Fritz Blaser in 1956, is a treasure trove of information on the origins and history of Swiss newspapers. This bibliography documents 483 and around a thousand periodicals published in Switzerland between 1803 and 1958 and documents them in great detail according to a given template – a database on paper.

The objective of the project is to extract the semi-structured information from Blaser’s newspaper bibliography (PDF files) and build a lightweight database (possibly in JSON only, or graph DB).

The extracted information will be used to

  • document the newspapers present in the Impresso web application;
  • support the study of the newspaper ecosystem in Switzerland at that time, e.g. by studying clusters of publications by political orientation over time, tracking publishers or editors, etc.

Steps 

  • Review tools that can be used to correct/redo the OCR of PDF files and select one;
  • Define a data model based on the information contained in the bibliography;
  • Extract and systematically store the information
  • Devise a way of assessing the quality of the extraction process.
  • If time permits, carry out an initial analysis of the database created.

If taken as a Master project:

  • Additional steps:
    • Perform named entity recognition and linking on the information present in some descriptive fields
    • Conduct a first analysis of the database, e.g.
      • Map printing locations  of newspapers in Switzerland, and their evolution through time
      • Create a network of main actors (editors, publishers)
      • …and more, this is a very rich source.
  • Similar sources at the European level could be integrated.

This project will be done in collaboration with researchers from the History Department of UNIL, members of the Impresso project.

Requirements

Good knowledge of Python, basics of software and data engineering, interest in historical data. Medium to good knowledge of French or German is required.

Organisation of work

  • Weekly meeting with supervisor(s)
  • The student is asked to submit a detailed project plan (envisaged steps, milestones) by the end of week 2.
  • The student is advised to document his/her work in a logbook regularly. Updates on progress, potential questions or problems will be listed in the logbook before the weekly meeting (at least 4 hours before).
  • A Slack channel is used for communication outside the weekly meeting.
  • The student is advised to start his/her project report between 3 and 2 weeks before the end of the project. Report on overleaf using the EPFL template.

Type: MSc (12 ECTS), BA (8ECST) Research project 
Sections: Computer Science, Data Science, Digital Humanities
Supervisors: Isabella di Lenardo, Raimund Schnürer 
Number of students: 1–2

Context: 

In recent years, various features have been extracted from historical maps thanks to advancements in machine learning. While the content of maps is relatively well studied, elements around the maps still deserve some attention. These elements are used, amongst others, for decoration (e.g. ornamentation, cartouches), orientation (e.g. scale bar, wind rose, north arrow), illustration (e.g. heraldic, figures, landscape scenes), and description (e.g. title, explanations, legend). The analysis of the style and arrangement of these elements will give valuable hints about the cartographer’s background.

Objective:

In this project, map layout elements shall be analysed in depth using a given dataset of 400.000 historical maps.

Main Steps:

  • Review literature about extracting map layout elements
  • Detect map layout elements in historical maps using artificial neural networks (e.g. segmentation)
  • Find similar elements between maps (e.g. by t-SNE)
  • Identify clusters among authors, between different regions and time periods
  • Visualize these connections

Research Questions:

  • How accurately can the elements be detected on historic maps?
  • Which visual properties are suited to find similarities between the elements?
  • Which connections exist between different maps?

References:

Requirements: 

Good programming skills, familiarity with machine learning, interest in historical maps

Type: MSc research project (12 ECTS) or MSc project
Sections: Data Science, Computer Science, Digital Humanities
Supervisors: Emanuela Boros, Maud Ehrmann 
Number of students: 1

Figure: Historical Instruction Mining for (finetuning) Large Language Models (Generated with Midjourney)

Context

Historical collections present multiple challenges that depend either on the quality of digitization, the need to handle documents deteriorated by the effect of time, the poor quality printing materials or inaccurate scanning processes such as optical character recognition (OCR) or optical layout recognition (OLR). Moreover, historical collections can pose another challenge due to the fact that documents are distributed over a long enough period of time to be affected by language change. This is especially true in the case of Western European languages, which only acquired their modern spelling standards roughly around the 18th or 19th centuries. At the same time, language models (LMs) have been leveraged for several years now, obtaining state-of-the-art performance in the majority of NLP tasks, by generally being fine-tuned on downstream tasks (such as entity recognition). LLMs or instruction-following models have taken over with relatively new capabilities in solving some of these tasks in a zero-shot manner through prompt engineering. For example, the generative pre-trained transformer (GPT) family of LLMs refers to a series of increasingly powerful and large-scale neural network architectures developed by OpenAI. Starting with GPT, the subsequent models have witnessed substantial growth, such as ChatGPT, GPT-4. These increased sizes allow the models to capture more intricate patterns in the training data, resulting in better performance on various tasks (like acing exams). Nevertheless, they seem to fail in understanding and reasoning when it handles historical documents. This project aims to at building a dataset in a semi-automatic manner for improving the application of LLMs in historical data analysis in the context of impresso – Media Monitoring of the Past II. Beyond Borders: Connecting Historical Newspapers and Radio project.

Research Questions

  • Is it feasible to create a dataset for training LLMs to better comprehend historical documents, using semi-automatic or automatic methods?
  • Can a specialized, resource-efficient LLM effectively process and understand noisy, historical digitized documents?

Objective

To develop an instruction-based dataset to improve LLMs’ capabilities in interpreting historical documents. The focus will be on sourcing and analyzing historical Swiss and Luxembourgish newspapers (spanning 200 years) and other historical collections in ancient Greek or Latin.

Instruction/Prompt: When was the Red Cross founded?

Example Answer: 1864

Instruction / Prompt: Given the following excerpt from a Luxembourgish newspaper from 1919, identify the main event and key figures involved. Excerpt: “En 1919, la Grande-Duchesse Charlotte est montée sur le trône du Luxembourg, succédant à sa sœur, la Grande-Duchesse Marie-Adélaïde, qui avait abdiqué en raison de controverses liées à la Première Guerre mondiale.”

Example Response: Grand Duchess Charlotte and her sister, Grand Duchess Marie-Adélaïde.

Main Steps

  1. Data Curation:
    • Collect OCR-based datasets.
    • Analyze historical newspaper articles to understand common features and challenges.
  2. Dataset Creation:
    • Decide on what type of instruction should be generated and utilise other existing LLMs such as T5, BART, etc., to generate instructions (or questions) from Swiss historical documents, similar to the method presented in ArchivalQA.
    • Train or finetune a LLaMA language model based on this dataset.
  3. Model Training/Fine-Tuning:
    • Train or fine-tune a language model like LLaMA on this dataset.
  4. Evaluation:
    • Assess LLMs’ performance on NLP tasks (NER, EL) using historical documents.
    • Compare models trained on the new dataset with those trained on standard datasets.
    • Employ metrics like accuracy, perplexity, F1 score.

Requirements

  • Proficiency in Python, ideally PyTorch.
  • Strong writing skills.
  • Commitment to the project.

Output

  • Potential publications in NLP and historical document processing.
  • Contribution to advancements in handling historical texts with LLMs.

Deliverables

  • A comprehensive dataset for training LLMs on historical texts.
  • A report or paper detailing the methodology, findings, and implications of the project.

References

Optical character recognition (OCR) or optical character reader is the electronic or mechanical conversion of images of typed, handwritten or printed text into machine-encoded text, whether from a scanned document, a photo of a document, a scene photo or from subtitle text superimposed on an image.

Postponed

Type: MA (12 ECTS), BA (8ECST) Research project
Sections: Computer Science, Data Science, Digital Humanities
Supervisors: Tristan Karch
Number of students: 1–2

Context: 

Language shapes the way we take actions. We use it all the time, to plan our day, organize our work and most importantly, to think about the world. Through language, we not only communicate with others but also internally process and understand our experiences. This ability to think about the world through language is also what enables us to engage in counterfactual reasoning [1], imagining alternative scenarios and outcomes to refine and enhance our experience of the world.

There is a growing body of work that investigates the deep interactions between language and decision-making systems [2]. In this context, large language models (LLMs) are used to design autonomous agents [3,4] that achieve complex tasks involving different reasoning patterns in textual interactive environments [5,6]. Such agents are equipped with different mechanisms such as reflexive [7] and memory [8] modules to continually adapt to new sets of tasks and foster generalisation. These modules rely on efficient prompts that help agents combine their environmental trajectories with their foundational knowledge of the world to solve advanced tasks.

Objective:

The objective of this project is to design and evaluate counterfactual learning mechanisms in LLM agents evolving in textual environments.

Research Questions:

  • Can we design reflexive mechanisms that autonomously generate counterfactuals from behavioral traces of agents evolving in textual environments?
  • What is the effect of counterfactuals on exploration? Adaptation? Generalization?

Main Steps:

  • Interdisciplinary literature review (LLM agents, language and reasoning, counterfactual reasoning);
  • Get familiar with benchmarks (Science world, Alf word, others?);
  • Re-implement baselines: Reflexion [7] and Clin [8];
  • Design counterfactual generation;
  • Derive metrics to analyze the impact of the proposed approach.

Requirements: Good programming skills, Experience working with RL and LLM is a plus.

References:

[1] The Functional Theory of Counterfactual Thinking – K. Epstude and N. Roese, Pers Soc Psychol Rev. 2008 May;12(2):168-92. doi: 10.1177/1088868308316091. PMID: 18453477; PMCID: PMC2408534.

[2] Language and Culture Internalisation for Human-Like Autotelic AI – Cédric Colas, Tristan Karch, Clément Moulin-Frier, Pierre-Yves Oudeyer. Nature Machine Intelligence.

[3] ReAct: Synergizing Reasoning and Acting in Language Models – Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, Yuan Cao, https://arxiv.org/abs/2210.03629

[4] Language Modes are Few-Shot Butlers, Vincent Micheli, Francois Fleuret, https://arxiv.org/abs/2104.07972

[5] ALFWorld: Aligning Text and Embodied Environments for Interactive Learning – Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Côté, Yonatan Bisk, Adam Trischler, Matthew Hausknecht, https://arxiv.org/abs/2010.03768

Oyvind Tafjord, Niket Tandon, Li Zhang, Chris Callison-Burch, Peter Clark, https://arxiv.org/abs/2310.10134

[6] ScienceWorld: Is your Agent Smarter than a 5th Grader? – Ruoyao Wang, Peter Jansen, Marc-Alexandre Côté, Prithviraj Ammanabrolu, https://arxiv.org/abs/2203.07540

[7] Reflexion: Language Agents with Verbal Reinforcement Learning – Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, Shunyu Yao, https://arxiv.org/abs/2303.11366

[8] CLIN: A Continually Learning Language Agent for Rapid Task Adaptation and Generalization – Bodhisattwa Prasad Majumder, Bhavana Dalvi Mishra, Peter Jansen, Oyvind Tafjord, Niket Tandon, Li Zhang, Chris Callison-Burch, Peter Clark. https://arxiv.org/pdf/2310.10134

Spring 2024

Available

Type: MSc Semester project
Sections: Architecture, Digital Humanities, Data Science, Computer Science
Supervisor: Beatrice Vaienti, Hamest Tamrazyan
Keywords: 3D modelling
Number of students: 1–2 (min. 12 ECTS in total)

Context

The Armenian cultural heritage in Artsakh [1] is in danger after explicit threats of irreversible destruction coordinated by Azerbaijan authorities [2]. As part of a series of actions coordinated by the EPFL, the Digital Humanities Institute is currently prototyping methods to offer rapid deployment of DH technology in situations of crisis. As part of these actions, over 200 Armenian inscriptions of Artsakh with essential information such as the language data including diplomatic and interpretive transcriptions, the translation into English, the location of the inscription on the monument (if applicable), geographical and chronological data, the type of monument, and the type of inscription systematize and digitize the inscriptions on the monuments of Armenian cultural heritage in Nagorno-Karabakh. This digitized data will help not only to preserve the invaluable inscriptions but also can be used for further investigations and research.  The aim of this project is to create a 3D model of the church successfully, accurately locate the inscriptions, and contribute to preserving, studying, and promoting Armenian cultural heritage in Nagorno-Karabakh.

Figure: By Julian Nyča – Own work, CC BY-SA 3.0

Research questions

  • How can advanced imaging and 3D modelling technologies be utilized to accurately capture and represent the intricate details of the church and the inscriptions?
  • What methods can be employed to ensure the precise alignment and placement of the digitized inscriptions within the 3D model of the church?
  • How can the digital representation of the church and its inscriptions be effectively integrated with the database of Armenian inscriptions in Nagorno-Karabakh?
  • What insights can be gained from analysing the spatial distribution and arrangement of the inscriptions within the church, shedding light on the historical and cultural context in which they were created?
  • How can the integration of the 3D model and the digitized inscriptions contribute to the preservation, documentation, and study of Armenian epigraphic heritage in Nagorno-Karabakh?

Objective

This project aims to deepen the knowledge of the architectural and epigraphic significance of the church, explore innovative techniques for digitizing and visualizing cultural heritage, and contribute to the preservation and accessibility of Armenian inscriptions in Nagorno-Karabakh.

Main steps

  1. Research and Planning: Conduct thorough research on the church and its inscriptions, architectural features, and existing documentation, and develop a detailed plan for creating a 3D model and locating the inscriptions within the model.
  2. Study of the plan, section, and elevation views from a survey: Proportional and compositional study aimed at the 3D reconstruction.
  3. 3D reconstruction: A low level of detail mesh already exists, but with this project, the student will try to transfer the information from the architectural survey to a refined architectural 3D model with interiors and exteriors.
  4. Data Processing and Digitization: Process the collected data to digitally represent the church and the inscriptions.
  5. Inscription Localization: Analyse the collected data and determine the precise location of each inscription within the 3D model of the church.
  6. Data Integration: Ensure that the essential information, such as language data, translations, geographical and chronological data, monument and inscription types, bibliographic references, and photographs, are correctly linked to the localised inscription.

Explored methods

  • Proportional analysis
  • 3D modelling using Rhino
  • 3D segmentation and annotation with the inscription
  • Exploration of visualization methodologies for this additionally embedded information

Requirements

  • Previous experience with architectural 3D modelling using Rhino.


[1] A toponym used by the local Armenians to refer to Nagorno-Karabagh territory
[2] the European Parliament resolution on the destruction of cultural heritage in Nagorno-Karabakh (2022/2582(RSP)) dated 09.03.20022.

Type: MSc Semester project (12 ECTS) or MSc project
Sections: Data Science, Computer Science, Digital Humanities
Supervisors: Frédéric Kaplan, Emanuela Boros, Maud Ehrmann 
Number of students: 1

Context

This project aims to explore the application of complex systems theory to Large Language Models (LLMs), like GPT. It will focus on understanding how concepts such as trajectory divergence, attractors, and chaotic sequences manifest in these advanced AI models. The project will use an LLM trained via reinforcement learning, providing a unique lens to examine the behavior and characteristics of these complex systems.
 
Objectives
  1. To investigate trajectory divergence in LLMs: We will study how minor variations in input (such as small changes in text) can lead to significantly different outputs, illustrating sensitivity to initial conditions.
  2. To identify attractors in LLMs: We will explore if there are recurring themes or patterns in the model’s outputs that act as attractors, regardless of varied inputs.
  3. To analyze chaotic sequences in model responses: By feeding a series of chaotic or nonlinear inputs, we aim to understand how the model’s responses demonstrate characteristics of chaotic systems.
  4. To utilize reinforcement learning in training LLMs: To observe how the introduction of reward-based training influences the development of these complex behaviors.
 
Methodology
 
  1. Data Collection and Preparation: We will generate a diverse set of input data to feed into the LLM, ensuring a range that can test for trajectory divergence and chaotic behavior.
  2. Model Training: An LLM will be trained using reinforcement learning techniques to adapt its response strategy based on predefined reward systems.
  3. Experimentation: The trained model will be subjected to various tests, including slight input modifications and chaotic input sequences, to observe the outcomes and patterns.
  4. Analysis and Visualization: Data analysis tools will be used to interpret the results, and visualization techniques will be applied to illustrate the complex dynamics observed.
Expected Outcomes
  • A deeper understanding of how complex system theories apply to LLMs.
  • Insights into the stability, variability, and predictability of LLMs.
  • Identification of potential attractor themes or patterns in LLM outputs.
  • A contribution to the broader discussion on AI behavior and its implications.

Requirements

Excellent technical skills, previous practical experience with LLMs and passion for the subject.

Type: MSc Semester project (12 ECTS) or MSc project
Sections: Data Science, Computer Science, Digital Humanities
Supervisors: Frédéric Kaplan, Emanuela Boros, Maud Ehrmann 
Number of students: 1

Context

The project aims to develop a compiler for GPT models that transcends traditional prompt engineering, enabling the creation of more structured and complex written pieces. Drawing parallels from the early days of computer programming, where efficiency and compactness in commands were crucial due to memory and space constraints, this project seeks to elevate the way we interact with and utilize Large Language Models (LLMs) for text generation.

Objectives

  1. To Create a Compiler for Enhanced Text Generation: Develop a compiler that translates user intentions into complex, structured narratives, moving beyond simple prompt responses.
  2. To Establish ‘Libraries’ for Complex Writing Projects: Similar to programming libraries, these would contain comprehensive information about characters, settings, and narrative logic, which can be loaded at the start of a writing session.
  3. To Facilitate Hierarchical Abstraction in Writing: Implement a system that allows for the creation of high-level abstractions in storytelling, akin to programming.
  4. To Enable Specialization in Narrative Elements: Support the development of specialized modules for characters, settings, narrative logic, and stylistic effects.

Methodology

  • Compiler Design: Designing a compiler capable of interpreting and translating complex narrative instructions into executable text generation tasks for LLMs like GPT.
  • Library Development: Creating a framework for users to build and store detailed narrative elements (characters, settings, etc.) that can be referenced by the compiler.
  • Abstraction Layers Implementation: Developing a system to manage and utilize different levels of narrative abstraction.
  • Integration with Various LLMs: Ensuring the compiler is adaptable to different LLMs, including OpenAI, Google, or open-source models.
  • Testing and Iteration: Conducting extensive testing to refine the compiler and its ability to handle complex writing tasks.

Expected Outcomes

  • A tool that allows for the creation of detailed and structured written works using LLMs.
  • A new approach to text generation that mirrors the evolution and specialization seen in computer programming.
  • Contributions to the field of AI-driven creative writing, enabling more complex and nuanced storytelling.
 
Requirements

Excellent technical skills, previous practical experience with LLMs and passion for the subject. 
  • guidance is a programming paradigm that offers superior control and efficiency compared to conventional prompting and chaining.

Type: MSc Semester project (12 ECTS) or MSc project
Sections: Data Science, Computer Science, Digital Humanities
Supervisors: Emanuela Boros, Maud Ehrmann 
Number of students: 1

Figure: Historical Instruction Mining for (finetuning) Large Language Models (Generated with Midjourney)

Context

Historical collections present multiple challenges that depend either on the quality of digitization, the need to handle documents deteriorated by the effect of time, the poor quality printing materials or inaccurate scanning processes such as optical character recognition (OCR) or optical layout recognition (OLR). Moreover, historical collections can pose another challenge due to the fact that documents are distributed over a long enough period of time to be affected by language change. This is especially true in the case of Western European languages, which only acquired their modern spelling standards roughly around the 18th or 19th centuries. At the same time, language models (LMs) have been leveraged for several years now, obtaining state-of-the-art performance in the majority of NLP tasks, by generally being fine-tuned on downstream tasks (such as entity recognition). LLMs or instruction-following models have taken over with relatively new capabilities in solving some of these tasks in a zero-shot manner through prompt engineering. For example, the generative pre-trained transformer (GPT) family of LLMs refers to a series of increasingly powerful and large-scale neural network architectures developed by OpenAI. Starting with GPT, the subsequent models have witnessed substantial growth, such as ChatGPT, GPT-4. These increased sizes allow the models to capture more intricate patterns in the training data, resulting in better performance on various tasks (like acing exams). Nevertheless, they seem to fail in understanding and reasoning when it handles historical documents. This project aims to at building a dataset in a semi-automatic manner for improving the application of LLMs in historical data analysis in the context of impresso – Media Monitoring of the Past II. Beyond Borders: Connecting Historical Newspapers and Radio project.

Research Questions

  • Is it feasible to create a dataset for training LLMs to better comprehend historical documents, using semi-automatic or automatic methods?
  • Can a specialized, resource-efficient LLM effectively process and understand noisy, historical digitized documents?

Objective

To develop an instruction-based dataset to improve LLMs’ capabilities in interpreting historical documents. The focus will be on sourcing and analyzing historical Swiss and Luxembourgish newspapers (spanning 200 years) and other historical collections in ancient Greek or Latin.

Instruction/Prompt: When was the Red Cross founded?

Example Answer: 1864

Instruction / Prompt: Given the following excerpt from a Luxembourgish newspaper from 1919, identify the main event and key figures involved. Excerpt: “En 1919, la Grande-Duchesse Charlotte est montée sur le trône du Luxembourg, succédant à sa sœur, la Grande-Duchesse Marie-Adélaïde, qui avait abdiqué en raison de controverses liées à la Première Guerre mondiale.”

Example Response: Grand Duchess Charlotte and her sister, Grand Duchess Marie-Adélaïde.

Main Steps

  1. Data Curation:
    • Collect OCR-based datasets.
    • Analyze historical newspaper articles to understand common features and challenges.
  2. Dataset Creation:
    • Decide on what type of instruction should be generated and utilise other existing LLMs such as T5, BART, etc., to generate instructions (or questions) from Swiss historical documents, similar to the method presented in ArchivalQA.
    • Train or finetune a LLaMA language model based on this dataset.
  3. Model Training/Fine-Tuning:
    • Train or fine-tune a language model like LLaMA on this dataset.
  4. Evaluation:
    • Assess LLMs’ performance on NLP tasks (NER, EL) using historical documents.
    • Compare models trained on the new dataset with those trained on standard datasets.
    • Employ metrics like accuracy, perplexity, F1 score.

Requirements

  • Proficiency in Python, ideally PyTorch.
  • Strong writing skills.
  • Commitment to the project.

Output

  • Potential publications in NLP and historical document processing.
  • Contribution to advancements in handling historical texts with LLMs.

Deliverables

  • A comprehensive dataset for training LLMs on historical texts.
  • A report or paper detailing the methodology, findings, and implications of the project.

References

Optical character recognition (OCR) or optical character reader is the electronic or mechanical conversion of images of typed, handwritten or printed text into machine-encoded text, whether from a scanned document, a photo of a document, a scene photo or from subtitle text superimposed on an image.

Taken

Type: MSc Semester project (12 ECTS) or MSc project
Sections: Data Science, Computer Science, Digital Humanities
Supervisors: Frédéric Kaplan, Emanuela Boros, Maud Ehrmann 
Number of students: 1

Context

This project aims to design and develop an innovative interface for iterative text composition, leveraging the capabilities of Large Language Models (LLMs) like GPT. The interface will enable users to collaboratively compose texts with the LLM, providing control and flexibility in the creative process.

Objectives

  1. To create a user-friendly interface for text composition: The interface should allow users to input, modify, and refine text generated by the LLM.
  2. To enable iterative interaction: Users should be able to interact iteratively with the LLM, adjusting and fine-tuning the generated text according to their needs and preferences.
  3. To incorporate customization options: The system should offer options to tailor the style, tone, and thematic elements of the generated text.

Methodology

  • Interface Design: Designing a user-friendly interface that allows for easy input and manipulation of text generated by the LLM.
  • LLM Integration: Integrating a LLM into the interface to generate text based on user inputs and interactions.
  • Customization and Control Features: Implementing features that allow users to customize the style and tone of the text and maintain control over the content.
  • User Testing and Feedback: Conducting user testing sessions to gather feedback and refine the interface and its functionalities.

Expected Outcomes

  • A functional interface that allows for collaborative text composition with a LLM.
  • Enhanced user experience in text creation, providing a blend of AI-generated content and human creativity.
  • Insights into how users interact with AI in creative processes.
 
Requirements

Excellent technical skills, previous practical experience with LLMs and passion for the subject. 
  • guidance is a programming paradigm that offers superior control and efficiency compared to conventional prompting and chaining.

Type: BA (8ECST) Semester project, MSc (12 ECTS)
Sections: Digital Humanities, Data Science, Computer Science
Supervisor: Hamest Tamrazyan, Emanuela Boros
Number of students: 1–2

Context

This project aims to explore, retrieve, and analyse data from the Digital Laboratory of Ukraine, focusing on a select collection of approximately ten books related to epigraphy and cultural heritage. The primary objective is to gain insights into Ukraine’s epigraphic and cultural heritage through detailed data analysis, term extraction, and database management.

Objective

This project aims to deepen the knowledge of the architectural and epigraphic significance of the church, explore innovative techniques for digitizing and visualizing cultural heritage, and contribute to the preservation and accessibility of Armenian inscriptions in Nagorno-Karabakh.

Objectives and Main steps

  1. Data Retrieval: Collect and aggregate data from the Digital Laboratory of Ukraine, specifically targeting books and resources about epigraphy and cultural heritage.
  2. Data Cleaning and Formatting: Implement data preprocessing techniques to ensure data quality. This includes removing irrelevant or corrupt data, handling missing values, and standardizing formats.
  3. Database Setup: Design and implement a database to store and manage retrieved data efficiently. The database should allow easy access and manipulation of the data for analysis.
  4. Term Extraction and Analysis: Employ natural language processing (NLP) techniques to extract key terms, concepts, and thematic elements from the texts. This will help us understand the predominant themes and patterns in Ukrainian epigraphy and cultural heritage.

Requirements

  • Proficiency in Python, knowledge of NLP techniques.

Significance

This project will contribute to the understanding of Ukraine’s rich cultural heritage. It will provide valuable digital resources for future academic and cultural research in this field.

Type: MSc Semester project (12 ECTS)
Sections: Data Science, Computer Science, Digital Humanities
Supervisors: Alexander Rusnak
Number of students: 1-3

Figure: Illustration of the SALMON training pipeline.

Context

All language models are taught some form of ethical system, whether implicitly through the curation of the dataset or by utilising some form of explicit training and prompting scheme. One type of ethical guidance framework is the Constitutional AI system proposed by the Anthropic team; this approach is predicated on prompting a language model to revise its own responses relative to a set of values which are then used to retrain the language model utilising supervised finetuning and reinforcement learning tutored by a preference model as per the standard RLHF setup. This approach has shown very strong results in improving both the ‘harmlessness‘ (i.e. ethical behaviour) and ‘helpfulness‘ of language models. However, the deontological ethical system they utilised has some key drawbacks.

Objective

This project will attempt to encode a virtue ethics framework into the model both in the selection of the values by which the responses are revised but also in the architectural structure itself. Virtue ethics focuses on three types of evaluation: the ethicality of the action itself, the motivation behind the action, the utility of the action towards promoting a virtuous character in the agent. To this end, the student will implement a separate preference model specifically for each of these three avenues of moral evaluation that will then be used for RL training of an LLM assistant. This should result in a model that has increases in both harmlessness and helpfulness, but also in explainability.

Main Steps

  1. Curate a dataset of adversarial prompts and ethics-oriented prompts to be used for training.
  2. Implement a reinforcement learning from AI feedback training structure following from Anthropic’s Claude or IBM’s SALMON.
  3. Create a custom prompting pipeline for the virtuous action preference model, motivational explanation preference model, and virtue formation preference model.
  4. Train the chatbot using each of the preference models separately and finally combined, and measure their comparative performance on difficult ethical questions.

Requirements

  • Knowledge of machine learning and deep learning principles, familiarity with language models, proficiency in Python and Pytorch, and interest in ethics and philosophy.

References

Fall 2023

Available

Type: MSc Semester project
Sections: Architecture, Digital Humanities, Data Science, Computer Science
Supervisor: Beatrice Vaienti, Hamest Tamrazyan
Keywords: 3D modelling

Number of students: 1–2 (min. 12 ECTS in total)

Context: The Armenian cultural heritage in Artsakh [1] is in danger after explicit threats of irreversible destruction coordinated by Azerbaijan authorities [2]. As part of a series of actions coordinated by the EPFL, the Digital Humanities Institute is currently prototyping methods to offer rapid deployment of DH technology in situations of crisis. As part of these actions, over 200 Armenian inscriptions of Artsakh with essential information such as the language data including diplomatic and interpretive transcriptions, the translation into English, the location of the inscription on the monument (if applicable), geographical and chronological data, the type of monument, and the type of inscription systematize and digitize the inscriptions on the monuments of Armenian cultural heritage in Nagorno-Karabakh. This digitized data will help not only to preserve the invaluable inscriptions but also can be used for further investigations and research.  The aim of this project is to create a 3D model of the church successfully, accurately locate the inscriptions, and contribute to preserving, studying, and promoting Armenian cultural heritage in Nagorno-Karabakh.

Figure: By Julian Nyča – Own work, CC BY-SA 3.0

Research questions:

  • How can advanced imaging and 3D modelling technologies be utilized to accurately capture and represent the intricate details of the church and the inscriptions?
  • What methods can be employed to ensure the precise alignment and placement of the digitized inscriptions within the 3D model of the church?
  • How can the digital representation of the church and its inscriptions be effectively integrated with the database of Armenian inscriptions in Nagorno-Karabakh?
  • What insights can be gained from analysing the spatial distribution and arrangement of the inscriptions within the church, shedding light on the historical and cultural context in which they were created?
  • How can the integration of the 3D model and the digitized inscriptions contribute to the preservation, documentation, and study of Armenian epigraphic heritage in Nagorno-Karabakh?

Objectives: This project aims to deepen the knowledge of the architectural and epigraphic significance of the church, explore innovative techniques for digitizing and visualizing cultural heritage, and contribute to the preservation and accessibility of Armenian inscriptions in Nagorno-Karabakh.

Main steps:

  • Research and Planning: Conduct thorough research on the church and its inscriptions, architectural features, and existing documentation, and develop a detailed plan for creating a 3D model and locating the inscriptions within the model.
  • Study of the plan, section, and elevation views from a survey: Proportional and compositional study aimed at the 3D reconstruction.
  • 3D reconstruction: A low level of detail mesh already exists, but with this project, the student will try to transfer the information from the architectural survey to a refined architectural 3D model with interiors and exteriors.
  • Data Processing and Digitization: Process the collected data to digitally represent the church and the inscriptions.
  • Inscription Localization: Analyse the collected data and determine the precise location of each inscription within the 3D model of the church.
  • Data Integration: Ensure that the essential information, such as language data, translations, geographical and chronological data, monument and inscription types, bibliographic references, and photographs, are correctly linked to the localized inscription.

Explored methods:

  • Proportional analysis
  • 3D modelling using Rhino
  • 3D segmentation and annotation with the inscription
  • Exploration of visualization methodologies for this additionally embedded information

Requirements: previous experience with architectural 3D modelling using Rhino.


[1] A toponym used by the local Armenians to refer to Nagorno-Karabagh territory
[2] the European Parliament resolution on the destruction of cultural heritage in Nagorno-Karabakh (2022/2582(RSP)) dated 09.03.20022.

Type: MSc Semester project (12 ECTS) or MSc project
Sections: Data Science, Computer Science, Digital Humanities
Supervisors: Emanuela Boros, Maud Ehrmann
Number of students: 1–2

Figure: Historical Instruction Mining for (finetuning) Large Language Models (Generated with Midjourney)

Context: Historical collections present multiple challenges that depend either on the quality of digitization, the need to handle documents deteriorated by the effect of time, the poor quality printing materials or inaccurate scanning processes such as optical character recognition (OCR) or optical layout recognition (OLR). Moreover, historical collections can pose another challenge due to the fact that documents are distributed over a long enough period of time to be affected by language change. This is especially true in the case of Western European languages, which only acquired their modern spelling standards roughly around the 18th or 19th centuries. At the same time, language models (LMs) have been leveraged for several years now, obtaining state-of-the-art performance in the majority of NLP tasks, by generally being fine-tuned on downstream tasks (such as entity recognition). LLMs or instruction-following models have taken over with relatively new capabilities in solving some of these tasks in a zero-shot manner through prompt engineering. For example, the generative pre-trained transformer (GPT) family of LLMs refers to a series of increasingly powerful and large-scale neural network architectures developed by OpenAI. Starting with GPT, the subsequent models have witnessed substantial growth, such as ChatGPT, GPT-4. These increased sizes allow the models to capture more intricate patterns in the training data, resulting in better performance on various tasks (like acing exams). Nevertheless, they seem to fail in understanding and reasoning when it handles historical documents. This project aims to at building a dataset in a semi-automatic manner for improving the application of LLMs in historical data analysis.

Research Questions:

  • Can we create a dataset in a (semi-automatic/automatic) manner for training an LLM to better understand historical documents?
  • Can a specialized, resource-efficient LLM effectively process noisy historical digitised documents?

Objective: The objective of this project is to develop an instruction-based dataset to enhance the ability of LLMs to understand and interpret historical documents. This will involve sourcing historical Swiss and Luxembourgish newspapers spanning 200 years, as well as other historical collections such as those in ancient Greek or Latin. Two fictive examples:

Instruction/Prompt: When was the Red Cross founded?

Example Answer: 1864

Instruction / Prompt: Given the following excerpt from a Luxembourgish newspaper from 1919, identify the main event and key figures involved. Excerpt: “En 1919, la Grande-Duchesse Charlotte est montée sur le trône du Luxembourg, succédant à sa sœur, la Grande-Duchesse Marie-Adélaïde, qui avait abdiqué en raison de controverses liées à la Première Guerre mondiale.”

Example Response: Grand Duchess Charlotte and her sister, Grand Duchess Marie-Adélaïde.

Main Steps:

  • Data Curation: Gather datasets based on OCR level, and familiarize with the corpus by exploring historical newspaper articles. Identify common features of historical documents and potential difficulties.
  • Dataset Creation: Decide on what type of instruction should be generated and utilise other existing LLMs such as T5, BART, etc., to generate instructions (or questions) from Swiss historical documents.
  • (Additional) Train or finetune a LLaMA language model based on this dataset.
  • (Additional) Evaluation: Utilise the newly created dataset to evaluate existing LLMs and assess their performance on various NLP tasks such as named entity recognition (NER) and linking (EL) in historical documents. Use standard metrics such as accuracy, perplexity, or F1 score for the evaluation. Compare the performance of models trained with the new dataset against those trained with standard datasets to ascertain the effectiveness of the new dataset.

Requirements: Proficiency in Python, preferably PyTorch, excellent writing skills, and dedication to the project.

Outputs: The project’s results could potentially lead to publications in relevant research areas and would contribute to the field of historical document processing.

Type: MSc Semester project (12 ECTS) or MSc project
Sections: Data Science, Computer Science, Digital Humanities
Supervisors: Emanuela Boros, Maud Ehrmann
Number of students: 1-2

Figure: Forecasting News (Generated with Midjourney)

Context: The rapid evolution and widespread adoption of digital media have led to an explosion of news content. News organizations are continuously publishing articles, making it increasingly challenging to keep track of daily developments and their implications. To navigate this overwhelming amount of information, computational methods capable of processing and making predictions about this content are of significant interest. With the advent of advanced machine learning techniques, such as Generative Adversarial Networks (GANs), and Large Language Models (LLMs), it’s possible to forecast future content based on existing articles. This project proposes to leverage the strengths of both GANs and LLMs to predict the content of next-day articles based on current-day news. This approach will not only allow for a better understanding of how events evolve over time, but also could serve as a tool for news agencies to anticipate and prepare for future news developments.

Research Questions:

  1. Can we design a system that effectively leverages GANs and LLMs to predict next-day article content based on current-day articles?
  2. How accurate are the generated articles when compared to the actual articles of the next day (how close to reality are they)?
  3. What are the limits and potential biases of such a system and how can they be mitigated?

Objective: The objective of this project is to design and implement a system that uses a combination of GANs and LLMs to predict the content of next-day news articles. This will be measured by the quality, coherence, and accuracy of the generated articles compared to the actual articles from the following day.

Main Steps:

  • Dataset Acquisition: Procure a dataset consisting of sequential daily articles from multiple sources.
  • Data Preprocessing: Clean and preprocess the data for training. This involves text normalization, tokenization, and the creation of appropriate training pairs.
  • Generator Network Design: Leverage an LLM as the generator network in the GAN. This network will generate the next-day article based on the input from the current-day article.
  • Discriminator Network Design: Build a discriminator network capable of distinguishing between the actual next-day article and the generated article.
  • GAN Training: Train the GAN system by alternating between training the discriminator to distinguish real vs generated articles, and training the generator to fool the discriminator.
  • Evaluation: Assess the generated articles based on measures of text coherence, relevance, and similarity to the actual next-day articles.
  • Bias and Limitations: Examine and discuss the potential limitations and biases of the system, proposing ways to address these issues.

Master Project Additions:

If the project is taken as a master project, the student will further:

  • Refine the Model: Apply advanced training techniques, and perform a detailed hyperparameter search to optimize the GAN’s performance.
  • Multi-Source Integration: Extend the model to handle and reconcile articles from multiple sources, aiming to generate a more comprehensive next-day article.
  • Long-Term Predictions: Investigate the model’s capabilities and limitations in making longer-term predictions, such as a week or a month in advance.

Requirements: Knowledge of machine learning and deep learning principles, familiarity with GANs and LLMs, proficiency in Python, and experience with a deep learning framework, preferably PyTorch.

Taken

Type: BA (8ECST) Semester project
Sections: Data Science, Computer Science, Digital Humanities
Supervisors: Emanuela Boros
Number of students: 1–2

Keywords: Document processing, Natural Language Processing (NLP), Information Extraction, Machine Learning, Deep Learning

Figure: Assessing Climate Change Perceptions and Behaviours in Historical Newspapers (Generated with Midjourney)

With emissions in line with current Paris Agreement commitments, global warming is projected to exceed 1.5°C above pre-industrial levels, even if these commitments are complemented by very difficult increases in magnitude and intensity and ambition of mitigation after 2030. Despite this slight increase, the consequences of global warming are already observable today, with the number and intensity of certain natural hazards continuing to increase (e.g., extreme weather events, floods, forest fires). Near-term warming and increased frequency, severity, and duration of extreme events will place many terrestrial, freshwater, coastal and marine ecosystems at high or very high risks of biodiversity loss. Exploring historical documents can help to address gaps in our understanding of the historical roots of climate change, and possibly uncover evidence of early efforts to address environmental issues, as well as explore how environmentalism has evolved over time. This project aims to fill gaps in our understanding by examining a corpus of historical Swiss and Luxembourgish newspapers spanning 200 years (i.e., the impresso project’s corpus).

Research Questions:

  • How have perceptions of climate change evolved over time, as seen in historical newspapers?
  • What behavioural trends towards climate change can be identified from historical newspapers?
  • Can we track the frequency and intensity of extreme weather events over time based on historical documents?
  • Can we identify any patterns or trends in early efforts to address environmental issues?
  • How has the sentiment towards climate change and environmentalism evolved over time?

Objective: This work explores several NLP techniques (text classification, information extraction, etc.) for providing a comprehensive understanding of the evolution and reporting of extreme weather events in historical documents.

Main Steps:

  • Data Preparation: Identify relevant keywords and phrases related to climate change and environmentalism, such as “global warming”, “carbon emissions”, “climate policy”, or “hurricane”, “flood”, “drought”, “heat wave”, and others. Compile a training dataset of articles that are around these relevant keywords.
  • Data Analysis: Analyse the data and identify patterns in climate change perceptions and behaviours over time. This includes the identification of changes in the frequency of climate-related terms, changes in sentiment towards climate change, changes in the topics discussed in relation to climate change, the detection of mentions of locations, persons, or events, and the extraction of important keywords in weather forecasting news.

Requirements: Candidates should have a background in machine learning, data engineering, and data science, with proficiency in NLP techniques such as Named Entity Recognition, Topic Detection, or Sentiment Analysis. A keen interest in climate change, history, and media studies is also beneficial.

Resources:

  1. Analyzing the impact of climate change on critical infrastructure from the scientific literature: A weakly supervised NLP approach
  2. Climate of scepticism: US newspaper coverage of the science of climate change
  3. We provide a nlp-beginner-starter jupyter notebook.

Type: MSc (12 ECTS) or BA (8ECST) Semester project
Sections: Computer Science, Data Science, Digital Humanities
Supervisors: Maud Ehrmann, Emanuela Boros
Number of students: 1–2

Figure: Exploring Large Vision-Language Pre-trained Models for Historical Images Classification and Captioning (Generated with Midjourney)

Context: The impresso project features a dataset of around 90 digitised historical newspapers containing approximately 3 million images. These images have no labels, and only 10% of them have a caption, two aspects that hinder their retrieval.

Two previous student projects focused on the automatic classification of these images, trying to identify 1. the type of image (e.g. photograph, illustration, drawing, graphic, cartoon, map), and 2. in the case of maps, the country or region of the world represented on that map. Good performances on image type classification were achieved by fine-tuning the VGG-16 pre-trained model (see report).

Objective: On the basis of these initial experiments and the annotated dataset compiled on this occasion, the present project will explore recent large-scale language-vision pre-trained models. Specifically, the project will attempt to: 

  1. Evaluate zero-shot image type classification of the available dataset using the CLIP and LLaMA-Adapter-V2 Multi-modal models, and compare with previous performances;
  2. Explore and evaluate image captioning of the same dataset, including trying to identify countries or regions of the world. This part will require adding caption information on the test part of the dataset. In addition to the fluency and accuracy of the generated captions, a specific aspect that might be taken into account is distinctiveness, i.e. whether the image contains details that differentiate it from similar images.

Given the recency of the models, further avenues of exploration may emerge during the course of the project, which is exploratory in nature.

Requirements: Knowledge of machine learning and deep learning principles, familiarity with computer vision, proficiency in Python, experience with a deep learning framework (preferably PyTorch), and interest in historical data.

References:

Spring 2023

Available

Type: MSc Semester project or Master project
Sections: Computer Science, Data Science, Digital humanities, Architecture
Supervisor: Frédéric Kaplan (CDH DHLAB) and Katrin Beyer (ENAC EESD)
Keywords: Historical architectural drawing, Automatic segmentation, Information Extraction. 
 
The goal of this Master or Semester project is extract information from historical architectural drawings. It is part of a larger project which goal is to develop a data acquisition and post-processing pipeline for deriving the exterior and interior geometry of historical buildings in terms of 3D point clouds. While images using for photogrammetry modelling contain a lot of information, they do not contain all geometric information of a structure that is relevant for architectural and structural engineering applications. For historical stone masonry buildings examples are embedment length of floor beams in walls or floor beam orientation, beam size and spacing in case of suspended ceilings. Such information can be sometimes found in historical architectural drawings. Comparing the as-built model to historical architectural drawings can also point to modifications to the structure and therefore serve as input for 4D geometric representations of the model. Furthermore, floor plans can serve as input when planning the data acquisition of interior spaces. For these reasons, we will develop approaches for automated reading of these features from historical architectural drawings. 
 
To extract information from historical architectural drawings we will build on methods for automated reading of modern construction floorplans  and the methods developed by the DHLAB for automated vectorisation of historical cadastral maps  to develop methods for historical floorplans and historical sections. The goal is to extract information on the floor plan and floor beam geometry, orientation, spacing and embedment length. For this purpose, we will retrieve and where necessary digitize historical architectural drawings of stone masonry buildings with timber floors in Swiss cultural heritage archives and complement this Swiss data with drawings from the many online architectural archives. As a first case study we may investigate the Old Hospital of Sion, which is owned by the city of Sion. The building was first mentioned in 1163 and has been extended and modified over the centuries. For this building, a large amount of documents in the form of texts, drawings and photos are available and have recently been reviewed for a seismic safety assessment of the building.
 
 
 

Taken

Type: MSc Semester project
Sections: Digital humanities, Data Science, Computer Science
Supervisor: Rémi Petitpierre
Keywords: Data visualisation, Web design, IIIF, Memetics, History of Cartography
Number of students: 1–2 (min. 12 ECTS in total)

Context: Cultural evolution can be modeled as an ensemble of ideas, and conventions, that spread from one mind to another. This is referred to as memes, elementary replicators of culture inspired by the concept of a gene. A meme can be a tune, a catch-phrase, or an architectural detail, for instance. If we take the example of maps, a meme can be expressed in the choice of a certain texture, a colour, or a symbol, to represent the environment. Thus, memes spread from one cartographer to another, through time and space, and can be used to study cultural evolution. With the help of computer vision, it is now possible to track those memes, by finding replicated visual elements through large datasets of digitised historical maps.

Below is an example of visual elements extracted from a 17th century map (original IIIF image). By extending the process to tens of thousands of maps and embedding these elements in a feature space, it becomes possible to estimate which elements correspond to the same replicated visual idea, or meme. This opens up new ways to understand how ideas and technologies spread across time and space.

Example extraction of the elementary visual elements from a 17th century French map.

Despite the immense potential that such data holds for better understanding cultural evolution, it remains difficult to interpret, since it involves tens of thousands memes, corresponding to millions of visual elements. Somewhat like genomics research, it is now becoming essential to develop a microscope to observe memetic interactions more closely.

Objectives: In this project, the student will tackle the challenge of designing and building a prototype interface for the exploration of the memes on the basis of replicated visual elements. The scientific challenge is to create a design that reflects the layered and interconnected nature of the data, composed of visual elements, maps, and larger sets. The student will develop its project by working with digital embeddings of replicated visual elements, and digitised images in IIIF framework. The interface will make use of the metadata to visualise how time, space, as well as the development of new technologies influence cartographic figuration, by filtering the visual elements. Finally, to reflect the multi-layered nature of the data, the design must be transparent and provide the ability to switch between visual elements and their original maps.
The project will draw on an exceptional dataset of tens of thousands of American, French, German, Dutch, and Swiss maps published between 1500 and 1950. Depending on the student’s interests, an interesting case study to demonstrate the benefits of the interface could be to investigate the impact of the invention of lithography, a revolutionary technology from the end of the 18th century, on the development of modern cartographic representations.

Prerequisites: Web Programming, basics of JavaScript.

Type: MSc (12 ECTS) or BA (8ECST) Semester project
Sections: Digital humanities, Computer Science
Supervisor: Beatrice Vaienti
Keywords: OCR, database
Number of students: 1

Context: An ongoing project at the Lab is focusing on the creation of a 4D database of the urban evolution of Jerusalem between 1840 and 1940. This database will not only depict the city in time as a 3D model, but also embed additional information and metadata about the architectural objects, thus employing the spatial representation of the city as a relational database. However, the scattered, partial and multilingual nature of the available historical sources underlying the construction of such a database makes it necessary to combine and structure them together, extracting from each one of them the information describing the architectural features of the buildings. 

Two architectural catalogues, “Ayyubid Jerusalem” (1187-1250) and “Mamluk Jerusalem: an Architectural Study”, contain respectively 22 and 64 chapters, each one describing a building of the Old City in a thorough way. The information they provide includes for instance the location of the buildings, their history (founder and date), an architectural description, pictures and technical drawings. 

Objectives: Starting from the scanned version of  these two books, the objective of the project is to develop a pipeline to extract and structure their content in a relational spatial database. The content of each chapter is structured in sections that cover systematically the various aspects of each building’s architecture and history. Along with this already structured information photos and technical drawings are accompanying the text: the richness of the images and architectural representations in the books should also be considered and integrated in the project outcomes. Particular emphasis will be placed on the extraction of information about the architectural appearance of buildings, which can then be used for procedural modelling tasks. Given these elements, three main sub-objectives are envisioned:

  1. OCR ;
  2. Organization of the extracted text in the original sections (or in new ways) and extraction of the information pertaining the architectural features of the buildings;
  3. Encoding the information in a spatial DB, using the locational information present in each chapter to geolocate the building, eventually associating its position with the existing geolocated building footprints.

Prerequisites: basic knowledge of database technologies and Python 

Type: MSc Semester project or Master project
Sections: Computer Science, Data Science, Digital humanities
Supervisor: Maud Ehrmann, as well as historians and network specialists from the C2DH Center from Luxembourg University.
Keywords: Document processing, NLP, machine learning

Context: News agencies (e.g. AFP, Reuters) have always played an important role in shaping the news landscape. Created in the 1830s and 1840s by groups of daily newspapers in order to share the costs of news gathering (especially abroad) and transmission, news agencies have gradually become key actors in news production, responsible for providing accurate and factual information in the form of agency releases. While studies exist on the impact of news agency content on contemporary news, little research has been done on the influence of news agency releases over time. During the 19C and 20C, to what extent did journalists rely on agency content to produce their stories? How was agency content used in historical newspapers, as simple verbatims (copy and paste) or with rephrasing? Were they systematically attributed or not? How did news agency releases circulate among newspapers and which ones went viral?

Objective: Based on a corpus of historical Swiss and Luxembourgish newspapers spanning 200 years (i.e., the impresso project’s corpus), the goal of this project is to develop a machine learning-based classifier to distinguish newspaper articles based on agency releases from other articles. In addition to detecting agency releases, the system could also identify the news agency behind the release.

Main steps:

  • Ground preparation with a) exploration of the corpus (via the impresso interface) to become familiar with historical newspaper articles and identify typical features of agency content as well as potential difficulties; b) compilation of a list of news agencies active throughout the 19C and 20C.
  • Construction of a training corpus building (in French), based on:
    • sampling and manually annotation;
    • a collection of manually labelled agency releases, which could be automatically extended by using previously computed text reuse clusters (data augmentation).
  • Training and evaluation of two (or more) agency release classifiers:
    • a keyword baseline (i.e. where the article contains the name of the agency);
    • a neural-based classifier

Master project – If the project is taken as a master project, the student will also work on the following: 

Processing:

  • Multilingual processing (French, German, Luxembourgish);
  • Systematic evaluation and comparison of different algorithms;
  • Identification of the news agency behind the release; 
  • Fine-grained characterisation of text modification;
  • Application of the classifier to the entire corpus. 

Analysis:

  • Study of the distribution of agency content and its key characteristics over time (in a data science fashion).
  • Based on the computed data, study of information flows based on a network representation of news agencies (node), news releases (edges) and newspapers (node).

Eventually, such work will enable the study of news flows across newspapers, countries, and time.

Requirements: Good knowledge in machine learning, data engineering and data science. Interest in media and history.

Fall 2022

There were a few projects only and we no longer have the capacity to host projects for that period. Check out around December 2022 what will be proposed for Spring 2023!

Spring 2022

Available

Type of project : Master/Semester thesis
Supervisors: Frédéric Kaplan
Semester of project: Spring 2021
Project summary: This project aims at exploring the relevance of Transformers architectures for 3D cloud points. In a first series of experiments, the student will use the large 3D cloud points of models produced at the DHLAB for the city of Venice or the City of Sion and tries to predict missing parts. In a second series of experiments the student will use the SRTM (Shuttle Radar Topography Mission, a NASA mission conducted in 2000 to obtain elevation data for most of the world) to encode / decode terrain prediction. 

Contact: Prof. Frédéric Kaplan

Type of project: Semester

Supervisors: Didier Dupertuis and Paul Guhennec.

Context: The Federal Office of Topography (Swisstopo) has recently made accessible high-resolution digitizations of historical maps of Switzerland, covering every year between 1844 and 2018. In order to be able to use these assets for geo-historical research and analysis, the information contained in the maps must be converted from its visual representation to an abstract, geometrical, form. This conversion from an input raster to an output vector geometry is typically well-performed by combining Convolutional Neural Networks for pixelwise classification to standard computer vision techniques, but might prove challenging for datasets with a larger figurative diversity, like in the case of Swiss historical maps, whose style varies over time.


Objective: The ambition of this work is to develop a pipeline capable of transforming the buildings and roads of the Swisstopo set of historical maps into geometries correctly positioned in an appropriate Geographic Coordinates System. The student will have to train a pre-existing semantic segmentation neural network on a manually annotated training set, evaluate its success, and fine-tune it for it to be applicable on the entire dataset. Depending on the interest of the student, some first diachronic analyses of the road network evolution in Switzerland can be considered.

Prerequisites:

  • Good skills with Python.
  • Basics in machine learning and computer vision are a plus.


Supervisors: Paul Guhennec and Federica Pardini.

Context: In the early days of the 19th century and made possible by the Napoleonic invasions of Europe, a vast scale administrative endeavour started to cartography as faithfully as possible the geometry of most cities of Europe. What results from these so-called Napoleonic cadasters is a very precious testimony of the state of these cities in the past at the European scale. For matters of practicality and detail, most cities are represented on separate sheets of paper, with margin overlaps between them and indications on how to reconstruct the complete picture.
Recent work at the laboratory has shown that it is possible to make use of the great homogeneity in the visual representations of the parcels of the cadaster to automatically vectorize them and classify them according to a fixed typology (private building, public building, roads, etc). However, the process of aligning these cadasters with the “real” world, by positioning them in a Geographic Coordinate System, and thus allowing large-scale quantitative analyses, remains challenging.


Objective: A first problem to tackle is the combination of the different sheets to make up the full city. Building on a previous student project, the student will develop a process to automatically align neighbouring sheets, while accounting for the imperfections and misregistrations in these historical documents. In a second stage, a pipeline will be developed in order to align the combined sheets obtained at the previous step on contemporary geographic data.

Prerequisites:

  • Good skills in Python.
  • Experience with computer vision libraries

Type of project: Semester project or Master thesis

Supervisors: Didier Dupertuis and Frédéric Kaplan.

Context: In 1798, after a millennium as a republic, Venice was taken over by Napoleonic armies. A new centralized administration was erected in the former city-state. It went on to create many valuable archival documents giving a precise image of the city and its population.

The DHLAB just finished the digitization of two complementary sets of documents: the cadastral maps and its accompanying registries, the “Sommarioni”. The cadastral maps give an accurate picture of the city with clear delination of numbered parcels. The Sommarioni registers contain  information about each parcel, including a one-line description of its owner(s) and type of usage (housing, business, etc.).

The cadastral maps have been vectorized, with precise geometries and numbering for each parcel. The 230’000 records of the Sommarioni have been transcribed. Resulting datasets have been brought together and can be explored in this interactive platform (only available via EPFL intranet or VPN).

Objective: The next challenge is to extract structured data from the Sommarioni owner descriptions, i.e. to recognize and disambiguate people,  business and institution names. The owner description is a noisy text snippet mentioning several relatives’ names; some records only contain the information that they are identical to the previous one; institution names might have different spellings; and there are many homonyms among people names.

The ideal output would be a list of disambiguated institutions and people with, for the latter, the network of their family members.

The main steps are as follows:

  • Definition of entity typology (individual, family or collective, business, etc.);
  • Entity extraction in each record, handling the specificities of each type;
  • Entity disambiguation and linking between records;
  • Creation of a confidence score for the linking and disambiguation to quantify uncertainty, and of different scenarios for different degrees of uncertainty;
  • If time permits, analysis and discussion of results in relation to the Venice of 1808;
  • If time, integration of the results in the interactive platform.

Prerequisites:

  • Good knowledge of python and data-wrangling;
  • No special knowledge of Venetian history is needed;
  • Proficiency in Italian is not necessary but would be a plus.

Taken

Type of project: Semester

Supervisors: Sven Najem-Meyer (DHLAB)  Matteo Romanello (UNIL).

Context: Optical Character Recognition aims at transforming images into machine-readable texts. Though neural networks helped to improve performances, historical documents remain extremely challenging. Commentaries to classical Greek literature epitomize this difficulty, as systems must cope with noisy scans, complex layouts and mixed Greek and Latin scripts.

Objective:  The objective of the project is to produce a system that can solve the problem of this highly complex data. Depending on your skills and interests, the project can focus on :

  • Image pre-processing
  • Multitasking : can we improve OCR, by processing task like layout analysis or named-entity recognition in parallel?
  • Benchmarking and fine-tuning available frameworks
  • Optimizing post-processing with NLP

Prerequisites:

  • Good skills in python ; libraries such as OpenCV or PyTorch are a plus.
  • Good knowledge in machine learning is advised, bases in computer vision and image processing would be a real plus.
  • No knowledge of ancient Greek/literature is required.

Type of project: Semester

Supervisor: Maud Ehrmann

Objective: Given a large archive of historical newspapers (cf. the impresso project) containing both text and image material, the objective of this project is to :

1/ Learn a model to classify images according to their types e.g. map, photograph, illustration, comics, ornament, drawing, caricature).
This first step will consist in:

  • the definition of the typology by looking at the material – although the targeted typology will be rather coarse;
  • the annotation of a small training set;
  • the training of a model by fine-tuning an existing visual model;
  • the evaluation of the said model.

2/ Apply this model on a large-scale collection of historical newspapers (inference), and possibly do the statistical profile of the recognized elements through time

Required skills:

  • basic knowledge in computer vision
  • ideally experience with PyTorch

Type of project: Semester (ideally  done in loose collaboration with the other semester project on image classification)

Supervisor: Maud Ehrmann

Objective: Given a large archive of historical newspapers (cf. the impresso project) containing both text and image material, the objective of this project is to :

1/ Learn a model for binary image classification: map vs. non-map.
This first step will consist in:

  • the annotation of a small training set (this step is best done in collaboration with project on image classification);
  • the training of a model by fine-tuning an existing visual model;
  • the evaluation of the said model.

2/ Learn a model for map classification (which country or region of the world is represented)

  • first exploration and qualification of map types in the corpus.
  • building of a training set, prob. with external sources
  • the training of a model by fine-tuning an existing visual model;
  • the evaluation of the said model.

Required skills:

  • basic knowledge in computer vision
  • ideally experience with PyTorch
Type of project : Master or semester project. 
 
Supervisors : Frederic Kaplan (DHLAB), Julien Fargeot (LCAV)
 
Context : Since 1994, the Centre for UNESCO in the French city of Troyes has organised an annual international drawing competition. Each year, a theme is proposed for the competition, which sees the participation of young people from all over the world, aged between 3 and 25 years. The winners’ works are then exhibited and will soon be highlighted by the opening of a dedicated museum, but all the drawings have been preserved and recently digitised. The 115,000 or so works from 150 countries over more than 25 years constitute an exceptional collection and a window on the imagination of young people over 25 years. Recent techniques of data learning and analysis will make it possible to explore this unique database, which this project aims to initiate.
 
Objective : The ambition of the project is to mine the drawing database to find patterns linked with of other work from Art History. The algorithms and methods developed in the DHLAB Replica project for searching morphological pattern in large scale database of artworks will serve as as starting point for this research. The goal will be to explore how the geographical origin and the age of the young artist impact the use of certain kind of references or drawing techniques. 
 
 

Type of project: Semester

Supervisors: Matteo Romanello (DHLAB), Maud Ehrmann (DHLAB), Andreas Spitz (DLAB)

Context: Digitized newspapers constitute an extraordinary goldmine of information about our past, and historians are among those who can most benefit from it. Impresso, an ongoing, collaborative research project based at the DHLAB, has been building a large-scale corpus of digitized newspapers: it currently contains 76 newspapers from Switzerland and Luxembourg (written in French, German and Luxembourgish) for a total of 12 billion tokens. This corpus was enriched with several layers of semantic information such as topics, text reuse and named entities (persons, locations, dates and organizations). The latter are particularly useful for historians as co-occurrences of named entities often indicate (and help to identify) historical events. The impresso corpus currently contains some 164 million entity mentions, linked to 500 thousand entities from DBpedia (partly mapped to Wikidata).

Yet, making use of such a large knowledge graph in an interactive tool for historians — such as the tool impresso has been developing — requires an underlying document model that facilitates the retrieval of entity relations, contexts, and related events from the documents effectively and efficiently. This is where LOAD comes into play, which is a graph-based document model that supports browsing, extracting and summarizing real world events in large collections of unstructured text based on named entities such as Locations, Organizations, Actors and Dates.

Objective: The student will build a LOAD model of the impresso corpus. In order to do so, an existing Java implementation, which uses MongoDB as its back-end, can be used as a starting point. Also, the student will have access to the MySQL and Solr instances where impresso semantic annotations have already been stored and indexed. Once the LOAD model is built, an already existing front-end called EVELIN will be used to create a first demonstrator of how LOAD can enable entity-driven exploration of the impresso corpus.

Required skills:

  • good proficiency in Java or Scala
  • familiarity with graph/network theory
  • experience with big data technologies (Kubernetes, Spark, etc.)
  • experience with PostgreSQL, MySQL or MongoDB

Note for potential candidates: In Spring and Fall 2020, two students have already been working on this project, but work and research perspectives are far from being exhausted and many things remain to be explored. We therefore propose to pursue, and the focus of the next project’s edition will be adapted according to the candidate’s background and preferences. Do not hesitate to get in touch!

Type of project: Semester

Supervisors: Rémi Petitpierre (IAGS), Paul Guhennec (DHLAB)

Context: The Bibliothèque Historique de la Ville de Paris digitised more than 700 maps covering in detail the evolution of the city from 1836 (plan Jacoubet) to 1900, including the famous Atlases Alphand. They are of a particular interest in the urban studies of Paris, which was at the time heavily transfigured by Haussmanian transformations. For administrative and political reasons, the City of Paris did not benefit from the large cadastration campaigns that occurred throughout Napoleonic Europe at the beginning of the 19th century. Therefore the Atlas’s sheets are the finest source of information available on the city over the 19th century. In order to make use of the great potential of this dataset, the information contained in the maps must be converted from its visual representation to an abstract, geometrical, form, on which to base quantitative studies. This conversion from an input raster to an output vector geometry is typically well-performed by combining Convolutional Neural Networks for pixelwise classification to standard computer vision techniques. In a second time, the student will develop quantitative techniques to investigate the transformation of the urban fabric.

Tasks

  • Semantically segment the Atlas de Paris maps as well as the 1900 plan, train a pre-existing semantic segmentation neural network on a manually annotated training set, evaluate its success, and fine-tune it for it to be applicable on the entire dataset.
  • Develop a semi-automatic pipeline to align the vectorised polygons on a geographic coordinate system.
  • Depending on the student’s interest, tackle some questions such as:
    • analysing the impact of the opening of new streets on mobility within the city;
    • detecting the appearing of locally aligned neighbourhoods (lotissements);
    • investigating the relation between the unique Parisian hygienist “ilot urbain” and the housing salubrity
    • studying the morphology of the city’s infrastructure network (e.g. sewers)

Prerequisites
Good skills with Python.
Bases in machine learning and computer vision are a plus.

More explorative projects 

For students who want to explore exciting uncharted territories in an autonomous manner. 

Type of project : Master thesis
Supervisors: Frédéric Kaplan
Project summary: This explorative project studies the property of text kernels, sentences that are invariant after automatic translation back and forth into a foreign language. The goal is to develop a prototype of text editor / transformation pipeline permitting to associate any sentence with its invariant form. 

Contact: Prof. Frédéric Kaplan

Type of project : Master thesis
Supervisors: Frédéric Kaplan
Project summary: This project aims at studying the potential of Transformers architecture for new kinds of language games between artificial agent and evolution of artificial languages. Artificial agents interact with one another about situation in the “world” and autonomously develop their own language on this basis. This project extends a long series of experiment that started in the 2000s. 

Contact: Prof. Frédéric Kaplan

Type of project : Master/Semester thesis
Supervisors: Frédéric Kaplan
Project summary: This explorative project consists in mining a large collection of novels to automatically extract characters and create an automatically generated dictionary of the novel characters. The characters will be associated with the novels in which they appear, with the characters they interact with and possibly with specific traits. 

Contact: Prof. Frédéric Kaplan

Fall 2022

Project type: Master

Supervisors: Maud Ehrmann and Simon Clematide (UZH)

Context:  The impresso project aims at semantically enriching 200 years of newspapers archives by applying a range of NLP techniques (OCR correction, named entity processing, topic modeling, text reuse, etc.). The source material comes from Swiss and Luxembourg national libraries and corresponds to the fac-similes and OCR outputs of ca. 200 newspaper titles in German and French. 

Problems

  • Automatic processing of these sources is greatly hindered by the sometimes low quality of legacy OCR and OLR (when present) processes: articles are incorrectly transcribed and incorrectly segmented. This has consequences on downstream text processing (e.g. a topic with only badly OCRized tokens). One solution is to filter out elements before they enter the text processing pipeline. This would imply recognizing specific parts of newspaper pages known to be particularly noisy such as: meteo tables, transport schedules, cross-words, etc.
  • Additionally, besides advertisement recognition, OLR does not provide any section classification (is this segment a title banner, a feuilleton, an editorial, etc.) and it would be useful to provide basic section/rubrique information.

Objectives:  Building on a previous master thesis which explored the interplay between textual and visual features for segment recognition and classification in historical newspapers (see project description, master thesis, and published article), this master project will focus on the development, evaluation, application and release of a documented pipeline for the accurate recognition and fine-grained semantic classification of tables.

Tables present several challenges, among others:

  • as usual, difference across time and sources;
  • visual clues can be confusing:
    • presence of mixed layout: table-like part + normal text
    • existence “quasi-tables” (e.g. lists)
  • variety of semantic classes: stock exchanges, TV/Radio program, transport schedules, sport results, events of the day, meteo, etc.

Main objectives will be:

  1. Creation and evaluation of table recognition and classification models.
  2. Application of these models on large-scale newspaper archives thanks to a software/pipeline which will be documented and released in view of further usage in academic context. This will support the concrete use case of specific dataset export by scholars.
  3. (bonus) Statistical profile of the large-scale table extraction data (frequencies, proportion in title/pages, comparison through time and titles).

Spring 2021

Type of project: Semester 

Supervisors: Albane Descombes, Frédéric Kaplan.

Description:

Photogrammetry is a 3D modelling technique which enables making highly precise models of our built environment, thus collecting lots of digital data on our architectural heritage – at least, as it remains today.

Over the years, a place could have been recorded by drawing, painting, photographing, scanning, depending on the evolution of measuring techniques.

For this reason, one has to mix various media to show the evolution of a building through the centuries. This project proposes to study the techniques which enable to overlay images over 3D photogrammetric models of Venice and of the Louvre museum. The models are provided by the DHLab, and were computed in the past years. The images of Venice come from the photo library of the Cini Foundation, and the images of Paris can be collected on Gallica (the digital French Library).

Eventually this project will deal with the issues of perspective, pattern and surface recognition in 2D and 3D, customizing 3D viewers to overlay images, and showcasing the result on a web page.

Example of image and photogrammetric overlay.

Type of project: Master

Supervisors: Albane Descombes, Frédéric Kaplan, Alain Dufaux.

Description:
 
Fréquence Banane is one of the oldest student’s associations of UNIL-EPFL campus, thus has collected plenty of audio content over the years on various media : recording tapes, CDs, hard disks, NAS.

A collection rich of hundreds of magnetic tapes contains the association’s first radio shows, recorded in the early 90’s after its creation. They include interviews and podcasts about Vivapoly, Forum EPFL or Balélec, which set the pace of every students’ life on campus since many years.

This project aims at studying the existing methods for digitizing magnetic tapes in the first place, then building a browsable database with all the digitized radio shows. The analysis of this audio content will be done using adapted speech recognition models.

This project is done in collaboration with Alain Dufaux, from the Cultural Heritage & Innovation Center.

Type of project : Master/Semester thesis
Supervisors: Paul Guhennec, Fabrice Berger, Frédéric Kaplan
Semester of project: Spring 2021
Project summary: The project consists in the development of a scriptable pipeline for producing procedural architecture using the Houdini 3D Procedural  environment. The project will start from existing Procedural models of the Venice in 1808 developed by at the DHLAB and automatise a pipeline to script the model out historical information recorded about each parcel. 
Contact: Prof. Frédéric Kaplan

Type of project : Master thesis
Supervisors: Maud Ehrmann, Frédéric Kaplan
Semester of project: Spring 2021
Project summary:  Thanks the digitisation and transcription campaign conducted during the Venice Time Machine, a digital collection of secondary sources is offering a arguably complete covering about all the historiography concerning Venice and its population at the 19th century. Through a manual and automatic process, the project will identify a series of hypotheses concerning the evolution of Venice functions, morphology and proprietaries network. These hypotheses will be translated in a formal language, keeping a direct link with the books and journals where they are expressed and systematically tested against the model of the city of Venice established through the integration of the models of the cadastral maps. In some cases, the data of the computational model of the city may go in contradiction with the hypotheses of the database and this will lead either to a revision of the hypotheses or a revision of the computational models of Venice established so far.

Type of project : Master/Semester thesis
Supervisors: Didier Dupertuis, Paul Guhennec, Frédéric Kaplan
Semester of project: Spring 2021
Project summary: As part of the ScanVan project, the city of Sion has been digitised in 3D. The goal is now to generate images of the city by days and nights and for the different seasons (summer, winter, spring and autumn) of the city. Contrastive Unpaired Translation architecture like the one used for transforming Venice images will be used for this project. 

Contact: Prof. Frédéric Kaplan

Type of project : Master/Semester thesis
Supervisors: Albanes Descombes, Frédéric Kaplan
Semester of project: Spring 2021
Project summary: This project aims at automatically transforming YouTube videos into 3D models using photogrammetry techniques. It extends the work of several Master / Semester projects that have made significant progress in this direction. The goal here is to design a pipeline that permits to georeference the extracted models and explore them with a 4D navigation interface developed at the DHLAB. 

Contact: Prof. Frédéric Kaplan