Tuesday 19 February 2019 – 16h00 – Room BC420
Abstract of Talk
Mass digitization has provided a mountain of source material for the humanities and social sciences, but its structure is unevenly mapped. Dependencies among documents arise when copying manuscripts, citing scholarly literature, speaking from talking points, reposting social networking content, popularizing scientific papers, or otherwise transforming earlier sources. While some dependencies are observable—e.g., by citations or links—we often need to infer them from textual evidence. In our Viral Texts and Oceanic Exchanges projects, we have built models to trace information flow within and across languages in poorly OCR’d newspapers. Other projects in our group infer and exploit such dependencies to model the writing of legislation, the impact of scientific press releases, and changes in the syntax of language.
I discuss methods for inferring these dependency structures and exploiting them to improve other tasks. First, I describe a directed spanning tree model of information cascades and a new unsupervised contrastive training procedure that outperforms previous approaches to network inference. I then describe extracting parallel passages from non-parallel multilingual corpora by performing efficient search in the continuous document-topic simplex of a polylingual topic model to train translation systems with greater accuracy than smaller clean datasets. Finally, I describe methods for detecting multiple transcriptions of the same passage in a large corpus of noisy OCR and for exploiting these multiple witnesses to correct noisy text. These multi-input attention models provide efficient approximations to intractable multi-sequence alignment collation and enable 75% reductions in error with unsupervised models.
David Smith is an associate professor in the College of Computer and Information Science at Northeastern University in Boston. He is also a founding member of the NULab for Texts, Maps, and Networks, Northeastern’s center for digital humanities and computational social sciences. His work on natural language processing focuses on applications to information retrieval, the social sciences, and humanities, on inferring network structures, and on computational linguistic models of structure learning and historical change.