Macro-Etymological Textual Analysis
- Jonathan Reeve
Word histories are correlated strongly with the tone and genre of a text. When a writer chooses the word “enchantment” instead of “spell,” or “inquire” instead of “ask,” this decision may indicate, to some degree, the speaker’s mode, dialect, or level of formality. These modes may be then be measured by quantifying the aggregated etymology of an entire text. The Macro-Etymological Analyzer is a command-line utility, written in Python, that quantifies the etymologies of a text using the Etymological Wordnet.
The figure above shows a macro-etymological analysis of the generic categories of the Brown Corpus, one of the most-studied digital linguistic corpora. Fictionality is shown here to be correlated very negatively with the proportions of Latinate words in a text, with government and learned documents showing the highest proportions, and the generic fiction categories of adventure and romance writing showing the lowest. Categories with debatable fictionality—lore and religion—fall directly in the center of this progression. This correlation might be used, therefore, not only as another possible metric in the computational categorization of fiction and nonfiction, but as a way to determine segments of fiction that share the most tonal properties with non-fiction, and vice-versa.
See also:
- Reeve, Jonathan. “A Macro-Etymological Analysis of James Joyce’s A Portrait of the Artist as a Young Man.” Reading Modernism with Machines. New York: Palgrave MacMillan, 2016. DOI: 10.1057/978-1-137-59569-0_9
- A Macro-Etymological Analysis of Milton’s Paradise Lost
- A Comparative Macro-Etymology of Whitman Editions