Prof. Antoine Doucet (Université de la Rochelle)
Thursday 11 October 2018
Talk organised by the Digital Humanities Institute and the Impresso Project
In the age of open and big data, the task of automatically analysing numerous media in various format and multiple languages is getting all the more critical. The ability to quickly and efficiently analyse massive amounts of documents, both digitised and digitally-born, is crucial. With a history dating a few centuries and a current rate of about hundreds of thousands of articles published every day, newspapers represent a heterogeneous resource of great importance.
This talk will present an approach that is able to detect events from news using very limited external resources, notably not requiring any form of linguistic analysis. By relying on the journalistic genre rather than on linguistic analysis, it is both able to process text written in any language, and in a fashion that is robust to noise (eg, stemming from imperfect OCR). Applied for instance to epidemic event detection, it is able to find what epidemic diseases are active where, in any language and in real time. Evaluated over 40 languages, the DaNIEL system is on average able to find epidemic events faster than human experts. In this presentation, we will further explain how this work is being expanded to further domains and particularly to the specific case of historical newspapers.