Group for Experimental Methods in Humanistic Research
at Columbia University

Middle English Manuscript OCR

  • Gianmarco Saretto
  • Jenna Alexis Schoen
updates ↓

09/30/19 Article accepted to Digital Philology pending revisions.
09/01/18 We compiled our training data (transcriptions from MS 198) and trained the OCR system Kraken. We then compiled ground-truth testing data and OCR’d different pages from MS 198 and four more manuscripts. The machine showed an accuracy rate of 90% on the training data, 85% on MS 198, and between 80% and 20% on the four other manuscripts (results varied greatly depending on the script, the layout, and the scribe).
03/11/18 The project received summer grant funding ($9000) with generous support from the Data Science Institute Scholars Program and the Data, Media, & Society Center.

In collaboration with a team of international scholars at the Open Islamicate Texts Initiative (OpenITI), our project aims to train an OCR system on a corpus of medieval manuscripts. The engine developed by OpenITI, Kraken, has a unique advantage in its line-based, rather than character-based, recognition of text, which makes it especially suitable for the density and occasional obscurity of Middle English handwriting.

We will train Kraken on a select set of manuscripts attributed to the same scribe (“Scribe D”). These include the Corpus Christi MS 198, the Plimpton MS 265, and the London Library MS V. 88. If successful, we would later train the system on an even larger set of manuscripts. Such a tool would have immense impact on medieval studies. Scholars could more easily compare manuscripts across a single textual tradition, or create digital editions for lesser-known texts. Above all, it would allow them to work on a massive number of untranscribed texts that seem “lost” on the current academic radar.