Working with In-Copyright Materials for Digital Humanities Research: Legal, Ethical, and Practical Issues
McLaughlin, Stephen Reid
MetadataShow full item record
To date, a significant chunk of digital humanities research projects have focused on analysis of works in the public domain, virtually all of them published prior to 1923. Greater access to recent publications would be a boon to the field, and in fact legal access to a large corpus of in-copyright works may be coming soon. A 2014 ruling by the Second Circuit Court of Appeals found that book digitization for the purpose of full-text search falls under fair use protection (Parker, 2014), leading the HathiTrust Research Center (HTRC) to announce that it will soon make its corpus of in-copyright works available for remote analysis on its own servers (“2014 Mid-Year Review,” 2014). It is so far unclear, however, when this project will go live and to whom access will be granted. In the meantime, there are several alternatives available to DH researchers. An unambiguously legal method for working with protected material is the manual digitization of physical books, either via typing or scanning. This practice clearly falls under fair use, provided such copies aren’t distributed publicly. However, the time and effort required to produce an acceptably clean version of a text limits the efficacy of this approach for all but the smallest-scale projects. Stepping into a legal gray area, one can use free tools such as Calibre to strip digital rights management (DRM) protection from ebooks purchased through online stores such as Amazon. The Digital Millennium Copyright Act (1998) prohibits DRM removal, but a recent ruling by a federal judge in New York suggests that the practice may in fact be legally acceptable for personal use (Cote, 2014). In any case, when carried out for the purpose of research, removal of DRM appears clearly ethically justified. In practical terms, commercially formatted ebooks are ideal for use in digital humanities research. An EPUB file is simply a compressed directory containing XHTML-formatted text and XML-encoded metadata (“EPUB 3 Overview,” 2014). Unlike in a plain text document, each chapter of an EPUB is clearly delimited, as are a book’s frontmatter and backmatter. And unlike working with PDFs, there is no need to correct gaps introduced by page breaks. With a bit of up-front work, then, many if not most recently published books can be quickly re-formatted for textual analysis — that is, if one is willing to purchase a copy. Stretching the limits of propriety, ebook piracy is a convenient (if ethically questionable) alternative available to contemporary DH scholars. Over the past half decade, websites offering illicit copies of ebooks have grown significantly in scope and comprehensiveness. Library Genesis (http://gen.lib.rus.ec) is an ad-free site based in Russia offering nearly two million ebooks and thirty-six million academic articles. AAAAARG (http://aaaaarg.org) hosts a comparatively smaller collection, clustered around a core collection of critical theory, art history, and philosophy. Finally, Ebook.farm (http://ebook.farm) is a very large private site which — unlike the others listed here — charges its users a small fee for each download.