Automatic Transcription in Colonial Contexts: OCR for the Primeros Libros

Alpert-Abrams, Hannah; Garrette, Dan

dc.contributor.author	Alpert-Abrams, Hannah
dc.contributor.author	Garrette, Dan
dc.date.accessioned	2016-05-02T21:39:16Z
dc.date.available	2016-05-02T21:39:16Z
dc.date.issued	2015-04-10
dc.identifier.uri	http://hdl.handle.net/10106/25662
dc.description	Poster Presentation	en_US
dc.description.abstract	The PDF images in the Primeros Libros digital collection, an effort to produce digital facsimiles of all books printed before 1601 in the Americas, pose several challenges for Optical Character Recognition (OCR) systems. The Ocular system, designed by Taylor Berg-Kirkpatrick et al., jointly models the physical operation of hand-press printing and the language of the written document, allowing it to ‘learn’ to read early printed books. Ocular cannot, however, handle the orthographic variation and code switching prevalent in the American context. Working with PDF images of trilingual texts in Spanish, Latin, and Nahuatl, we set out to modify Ocular for use on the Primeros libros collection. In this paper, we present our OCR tool for the Primeros Libros collection, an extension of Ocular which can handle multilingual documents, and which includes an interface for the incorporation of orthographic idiosyncrasies. At the same time, we argue for a situated analysis of digitization tools which considers Ocular's statistical models within the context of the Primeros Libros collection. As Walter Mignolo has shown, books from early colonial Mexico embody a larger project of language codification which was deeply embedded in the colonization and religious conversion of New Spain. The mathematical simplicity of Ocular's statistical models suggests a neutral engagement with the text that disguises a deep engagement with these colonial processes. Automatic transcription in this context becomes a process with significant implications for the ideological positioning of digitization projects.	en_US
dc.language.iso	en_US	en_US
dc.subject	OCR	en_US
dc.subject	Digital Facsimilies	en_US
dc.subject	Ocular	en_US
dc.subject	Digital Collections	en_US
dc.subject	Scanning	en_US
dc.title	Automatic Transcription in Colonial Contexts: OCR for the Primeros Libros	en_US
dc.type	Presentation	en_US

Files in this item

Name:: Abrams-Garrette.jpg
Size:: 388.3Kb
Format:: JPEG image
Description:: JPG

View/Open

Name:: Abrams-Garrette.pdf
Size:: 2.097Mb
Format:: PDF
Description:: PDF

View/Open

This item appears in the following Collection(s)

TXDHC 2015 Presenter Abstracts

Show simple item record