“impresso. Media Monitoring of the Past” is an interdisciplinary research project in which a team of computational linguists, designers and historians collaborate on the datafication of a multilingual corpus of digitized historical newspapers. The primary goals of the project are to improve text mining tools for historical text, to enrich historical newspapers with automatically generated data and to integrate such data into historical research workflows by means of a newly developed user interface. Beyond the challenges specific to the different research areas underpinning each of these objectives, the question of how best to adapt text mining tools and their use by humanities scholars is at the heart of the impresso enterprise.
The impresso project was funded by the Swiss National Science Foundation (SNSF) under grant number CR- SII5_173719 from Sept 2017 until Dec ( impresso on the SNSF grant portal).
Applicants – The impresso project draws on the expertise of three leading institutions in digital humanities, computational linguistics and digital history from Luxembourg and Switzerland.
- Digital Humanities Laboratory (DHLAB), EPFL;
- Institute of Computational Linguistics (ICL), University of Zurich;
- Centre for Contemporary and Digital History (C2DH), University of Luxembourg.
Project partners – impresso receives high-quality content from national libraries, archives and newspapers across Europe. A team of associated historians ensures that impresso meets the needs and quality standards of its target audience.
- Swiss National Library (BN)
- National Library of Luxembourg (BNL)
- State Archives of Valais
- Swiss Economic Archives (SWA)
- Le Temps
- Neue Zürcher Zeitung
- History Department, University of Lausanne (UNIL)
- infoclio, the Swiss portal for the historical sciences
In addition to workshops and events around the development of the impresso interface, various other activities took place:
- CLEF-HIPE-2020 evaluation campaign
- Eldorado workshop: Digitized newspapers – a new Eldorado for historians? (De Gruyter volume to appear).
- Tutorial on Named Entities processing for Digital Humanities (DH conference 2019, Utrecht)
- Formation continue UNIL-EPFL (2020 and 2023) – Histoire: perspectives et analyses numériques
- EPFL SHS teaching on Press and digital methodologies based on the impresso interface.
- C2DH Forum Z on digitized newspapers
- NE-annotated historical newspapers (CLEF-HIPE-2020)
- Datasets and models for historical newspaper semantic segmentation
- Survey of digitized newspaper interfaces (dataset and notebooks)
- HIPE-2022 Shared Task Named Entity Datasets
- Historical newspaper corpora (to be published soon)
- Historical newspaper semantic annotations (to be published soon)
section under construction
(See the publications page of the project website for a complete list).
Explorer la presse numérisée : le projet Impresso
« Impresso – Media Monitoring of the Past » est un projet de recherche interdisciplinaire dans lequel une équipe d’historiens, de linguistes informaticiens et de designers collabore à la mise en données d’un corpus d’archives de presse numérisées. Les principaux objectifs du projet sont d’améliorer les outils d’extraction d’information pour les textes historiques, d’indexer sémantiquement des journaux historiques, et d’intégrer les enrichissements obtenus dans les pratiques de recherche des historiens au moyen d’une interface nouvellement développée.Revue Historique Vaudoise. 2021-11-27. Vol. 129/2021, p. 159-173.
Historical Newspaper Content Mining: Revisiting the impresso Project’s Challenges in Text and Image Processing, Design and Historical Scholarship
Long abstract for a presentation at DH2020 (online).2020. Digital Humanities Conference (DH), Ottawa, Canada, July 20-24, 2020. DOI : 10.5281/zenodo.4641894.
Language Resources for Historical Newspapers: the Impresso Collection
Following decades of massive digitization, an unprecedented amount of historical document facsimiles can now be retrieved and accessed via cultural heritage online portals. If this represents a huge step forward in terms of preservation and accessibility, the next fundamental challenge– and real promise of digitization– is to exploit the contents of these digital assets, and therefore to adapt and develop appropriate language technologies to search and retrieve information from this `Big Data of the Past’. Yet, the application of text processing tools on historical documents in general, and historical newspapers in particular, poses new challenges, and crucially requires appropriate language resources. In this context, this paper presents a collection of historical newspaper data sets composed of text and image resources, curated and published within the context of the `impresso – Media Monitoring of the Past’ project. With corpora, benchmarks, semantic annotations and language models in French, German and Luxembourgish covering ca. 200 years, the objective of the impresso resource collection is to contribute to historical language resources, and thereby strengthen the robustness of approaches to non-standard inputs and foster efficient processing of historical documents.2020-05-11. 12th International Conference on Language Resources and Evaluation (LREC), Marseille, France, May 11-16 2020. p. 958-968. DOI : 10.5281/zenodo.4641902.
Combining Visual and Textual Features for Semantic Segmentation of Historical Newspapers
The massive amounts of digitized historical documents acquired over the last decades naturally lend themselves to automatic processing and exploration. Research work seeking to automatically process facsimiles and extract information thereby are multiplying with, as a first essential step, document layout analysis. Although the identification and categorization of segments of interest in document images have seen significant progress over the last years thanks to deep learning techniques, many challenges remain with, among others, the use of more fine-grained segmentation typologies and the consideration of complex, heterogeneous documents such as historical newspapers. Besides, most approaches consider visual features only, ignoring textual signal. We introduce a multimodal neural model for the semantic segmentation of historical newspapers that directly combines visual features at pixel level with text embedding maps derived from, potentially noisy, OCR output. Based on a series of experiments on diachronic Swiss and Luxembourgish newspapers, we investigate the predictive power of visual and textual features and their capacity to generalize across time and sources. Results show consistent improvement of multimodal models in comparison to a strong visual baseline, as well as better robustness to the wide variety of our material.Journal of Data Mining & Digital Humanities. 2021. Vol. 2021, num. Special Issue on HistoInformatics: Computational Approaches to History, p. 1-26. DOI : 10.5281/zenodo.4065271.
The impresso system architecture in a nutshell
This post describes the impresso application architecture and processing in a nutshell. The text was published in October 2020 in issue number 16 of the EuropeanaTech Insights dedicated to digitized newspapers and edited by Gregory Markus and Clemens Neudecker: https://pro.europeana.eu/page/issue-16-newspapers#the-impresso-system-architecture-in-a-nutshell
Extended Overview of CLEF HIPE 2020: Named Entity Processing on Historical Newspapers
This paper presents an extended overview of the first edition of HIPE (Identifying Historical People, Places and other Entities), a pioneering shared task dedicated to the evaluation of named entity processing on historical newspapers in French, German and English. Since its introduction some twenty years ago, named entity (NE) processing has become an essential component of virtually any text mining application and has undergone major changes. Recently, two main trends characterise its developments: the adoption of deep learning architectures and the consideration of textual material originating from historical and cultural heritage collections. While the former opens up new opportunities, the latter introduces new challenges with heterogeneous, historical and noisy inputs. In this context, the objective of HIPE, run as part of the CLEF 2020 conference, is threefold: strengthening the robustness of existing approaches on non-standard inputs, enabling performance comparison of NE processing on historical texts, and, in the long run, fostering efficient semantic indexing of historical documents. Tasks, corpora, and results of 13 participating teams are presented. Compared to the condensed overview , this paper includes further details about data generation and statistics, additional information on participating systems, and the presentation of complementary results.2020-10-21. 11th Conference and Labs of the Evaluation Forum (CLEF 2020), [Online event], 22-25 September, 2020. DOI : 10.5281/zenodo.4117566.
Historical Newspaper User Interfaces: A Review
After decades of large-scale digitization, many historical newspaper collections are just one click away via online portals developed and supported by various public or private stakeholders. Initially offering access to full text search and facsimiles visualization only, historic newspaper user interfaces are increasingly integrating advanced exploration features based on the application of text mining tools to digitized sources. As gateways to enriched material, such interfaces are however not neutral and play a fundamental role in how users perceive historical sources, understand potential biases of upstream processes and benefit from the opportunities of datafication. What features can be found in current interfaces, and to what degree do interfaces adopt novel technologies? This paper presents a survey of interfaces for digitized historical newspapers with the aim of mapping the current state of the art and identifying recent trends with regard to content presentation, enrichment and user interaction. We devised 6 interface assessment criteria and reviewed twenty-four interfaces based on ca. 140 predefined features.2019-09-02. 85th IFLA General Conference and Assembly, Athens, Greece, 24-30 August 2019. p. 1-24. DOI : 10.5281/zenodo.3404155.