Key concepts

Big Data of the past

Big Data is not a new phenomenon. History is punctuated by regimes of data acceleration, characterized by feelings of information overload accompanied by periods of social transformation and the invention of new technologies. During these moments, private organizations, administrative powers, and sometimes isolated individuals have produced important datasets, organized following a logic that is often subsequently superseded but was at the time, nevertheless, coherent. To be translated into relevant sources of information about our past, these document series need to be redocumented using contemporary paradigms.

Index-driven digitization

The promise of digitization of historical archives lies in their indexation at the level of contents. Unfortunately, this kind of indexation does not scale, if done manually. Index-driven digitization is a method to bootstrap the deployment of a content-based information system for digitized historical archives, relying on historical indexing tools.

Digitizing books without opening them

It is possible to develop new, faster, and safer ways to digitalize manuscripts, without opening them, using X-ray tomography.

Handwritten Text Recognition

Automatic transcription of handwritten texts has made important progress in the recent years. This increase in performance, essentially due to new architectures combining convolutional neural networks with recurrent neutral networks, opens new avenues for searching in large databases of archival and library records. This paper reports on our recent progress in making million digitized Venetian documents searchable, focusing on a first subset of 18th century fiscal documents from the Venetian State Archives.  On average, the machine outperforms the amateur transcribers in this transcription tasks.

Visual Pattern Discovery

The digitization of large databases of works of arts photographs opens new avenue for research in art history. For instance, collecting and analyzing painting representations beyond the relatively small number of commonly accessible works was previously extremely challenging. In the coming years,researchers are likely to have an easier access not only to representations of paintings from museums archives but also from private collections, fine arts auction houses, art historian However, the access to large online database is in itself not sufficient. There is a need for efficient search engines, capable of searching painting representations not only on the basis of textual metadata but also directly through visual queries.

Cadastral Computing

The cadastres established during the first years of the 19th century cover a large part of Europe. For many cities they give one of the first geometrical surveys, linking precise parcels with identification numbers. These identification numbers points to registers where the names of the proprietary. As the Napoleonic cadastres include millions of parcels , it therefore offers a detailed snapshot of large part of Europe’s population at the beginning of the 19th century. As many kinds of computation can be done on such a large object, we use the neologism “cadastral computing” to refer to the operations performed on such datasets. This approach is the first fully automatic pipeline to transform the Napoleonic Cadastres into an information system.

Generic Document Segmentation

In recent years there have been multiple successful attempts tackling document processing problems separately by designing task specific hand-tuned strategies. The diversity of historical document processing tasks prohibits to solve them one at a time and shows a need for designing generic approaches in order to handle the variability of historical series. Generic Document Segmentation address multiple tasks simultaneously such as page extraction, baseline extraction, layout analysis or multiple typologies of illustrations and photograph extraction.

Latest version:

Github :

Automatic reference extraction and parsing

The advent of large-scale citation indexes has greatly impacted the retrieval of scientific information in several domains of research. The humanities have largely remained outside of this shift, despite their increasing reliance on digital means for information seeking. Given that publications in the humanities have a longer than average life-span, mainly due to the importance of monographs for the field. Automatic reference extraction and parsing methods permit to select a corpus of reference monographs, and proposes a pipeline to extract the network of publications they refer to.

Record Linkage with Sparse Historical Data

Massive digitization of archival material, coupled with automatic document processing techniques and data visualisation tools offers great opportunities for reconstructing and exploring the past. Unprecedented wealth of historical data (e.g. names of persons, places, transaction records) can indeed be gathered through the transcription and annotation of digitized documents and thereby foster large-scale studies of past societies. Yet, the transformation of hand-written documents into well-represented, structured and connected data is not straightforward and requires several processing steps. In this regard, a key issue is entity record linkage, a process aiming at linking different mentions in texts which refer to the same entity. Also known as entity disambiguation, record linkage is essential in that it allows to identify genuine individuals, to aggregate multi-source information about single entities, and to reconstruct networks across documents and document series.

Metaknowledge encoding

Historical knowledge is fundamentally uncertain. A given account of an historical event is typically based on a series of sources and on sequences of interpretation and reasoning based on these sources. Generally, the product of this historical research takes the form of a synthesis, like a narrative or a map, but does not give a precise account of the intellectual process that led to this result. Our work on Metaknowledge consists of developing a methodology, based on semantic web technologies, to encode historical knowledge, while documenting, in detail, the intellectual sequences linking the historical sources with a given encoding, also know as paradata. More generally, the aim of this methodology is to build systems capable of representing multiple historical realities, as they are used to document the underlying processes in the construction of possible knowledge spaces.