- Jonathan Reeve
(Adapted from the original blog post on jonreeve.com.)
Git-Lit is an initiative to parse, version control, and post each of the approximately 50,000 works in the British Library’s corpus of digital texts. Parsing the texts will transform the machine-readable metadata into human-readable prefatory material; version controlling the texts will allow for collaborative editing and revision of the texts, effectively crowdsourcing the correction of OCR errors; and posting the texts to GitHub will ensure the texts’ visibility to the greater community.
Git-Lit addresses these issues:
- Electronic Texts are difficult to edit. There does not yet exist an efficient, streamlined way to improve the quality of electronic texts. What is needed is an open-source, decentralized model for community-centered editing. This model already exists for software development in the form of git. By posting a text to GitHub, we can take advantage of the fork/revise/pull-request workflow that programmers have long enjoyed for software collaboration.
- Textual corpora are difficult to assemble. With some exceptions (notably the NLTK corpus module), downloading a text corpus involves compiling texts from many heterogeneous sources. Git provides an easy way to solve these problems. By making texts available through the git protocol on GitHub, anyone who wishes to download a text corpus can simply run
git clonefollowed by the repository URL. Parent repositories can then be assembled for collections of texts using git submodules—a parent corpus repository might be created for nineteenth-century Bildungsromane, for instance, and that repository would contain pointers to individual texts that themselves are repositories.
- ALTO XML is not very human-readable. ALTO XML, the OCR output format used by the British Library, the Library of Congress, and others, is verbose. It encodes the location of each OCRed word, and often lists the degree of OCR certainty for each word. This is useful for archival purposes, but isn’t an ideal starting point for the kinds of text analysis typically done in the digital humanities. What is needed is a script to transform texts in this format into a human-readable format like ASCIIDOC that maintains as many of the original features of the text as possible.
A British Library text contains ALTO XML textual data as well as a Library of Congress METS XML metadata file. Git-Lit does the following:
- Reads the metadata file to determine the text’s title, author (if any), and other pertinent information.
- Initializes an empty git repository within the text directory, and makes an initial commit containing the text in its raw state.
- Generates a README file with the metadata, a CONTRIBUTING file explaining how to contribute towards improving the text, and a LICENSE file containing the GNU Public License.
- Commits these new files to the git repository, effectively creating a new version.
- Creates and pushes a new GitHub repository for the text.
Git-Lit has just been used to parse these four sample texts, generating the four GitHub repositories that can be found at the Git-Lit organization site. You can read, fork, modify, or comment on the code at the project repository at GitHub.
As this project develops, we’ll create indices for the texts in the form of submodule pointers. Category-based parent repositories might include “17th-Century Novels,” “18th-Century Correspondence,” or simply “Poetry,” but the categories are not mutually exclusive by necessity. This will allow a literary scholar interested in a particular category to instantly assemble a corpus by
git cloneing the parent repository and checking out its submodules with
git submodule update --init --recursive.
Later, we’ll create scripts to transform the texts in more useful formats, like ASCIIDOC and TEI XML. This will make archival-quality versions of the texts and will allow for rich scholarly markup.
How to Contribute
To contribute, contact the Git-Lit organization on GitHub, or find an issue you can tackle on the project issue tracker. Feel free to add your own features, restructure the code, or make any other improvements.