Group for Experimental Methods in Humanistic Research
at Columbia University

Macro-Etymological Textual Analysis

research paper
  • Jonathan Reeve
updates ↓

12/01/16 “A Macro-Etymological Analysis of James Joyce’s A Portrait of the Artist as a Young Man” published in Reading Modernism with Machines, Palgrave Macmillan, 2016

Word histories are correlated strongly with the tone and genre of a text. When a writer chooses the word “enchantment” instead of “spell,” or “inquire” instead of “ask,” this decision may indicate, to some degree, the speaker’s mode, dialect, or level of formality. These modes may be then be measured by quantifying the aggregated etymology of an entire text. The Macro-Etymological Analyzer is a command-line utility, written in Python, that quantifies the etymologies of a text using the Etymological Wordnet.

Macro-Etymology of the Brown Corpus

The figure above shows a macro-etymological analysis of the generic categories of the Brown Corpus, one of the most-studied digital linguistic corpora. Fictionality is shown here to be correlated very negatively with the proportions of Latinate words in a text, with government and learned documents showing the highest proportions, and the generic fiction categories of adventure and romance writing showing the lowest. Categories with debatable fictionality—lore and religion—fall directly in the center of this progression. This correlation might be used, therefore, not only as another possible metric in the computational categorization of fiction and nonfiction, but as a way to determine segments of fiction that share the most tonal properties with non-fiction, and vice-versa.

See also: