Group for Experimental Methods in Humanistic Research
at Columbia University

Text Divider: Quick Markup for Chapter and Dialogue Splitting

python script
  • Moacir P. de Sá Pereira

This python script breaks up a text into its internal sections. It uses a light markup scheme to signal where chapters and sections begin, and it also can keep track of dialogue by speaker. Given an electronic version of The Great Gatsby, for example, after the markup, it is possible to extract only Tom Buchanan’s lines.

The markup that breaks out the sections and dialogue was created by David Hoover, though the entirety of Prof. Hoover’s markup scheme has not been implemented here.

Read the README at GitHub.

The current state of the markup is:

<1>    text division level 1 (chapter, say)
<2>    text division level 2 (subchapter, say)
/      new speaker (character)
\      reporting clause (“speech marker”)


As a result, the opening of The Great Gatsby can be marked up as:


<1>The Great Gatsby

Then wear the gold hat, if that will move her;
If you can bounce high, bounce for her too,
Till she cry "Lover, gold-hatted, high-bouncing lover,
I must have you!"

THOMAS PARKE D'INVILLIERS.

<2>Chapter 1

In my younger and more vulnerable years my father gave me some advice
that I've been turning over in my mind ever since.

/Mr. Carraway"Whenever you feel like criticizing any one," 

\he told me, 

/Mr. Carraway"just remember that all the people in this world haven't had the
advantages that you've had."

[...]

<2>Chapter 2


Here, I have arbitrarily designated the novel itself as level 1 of the text division, thereby making each chapter level 2. When Mr. Carraway speaks, his speech is introduced with /Mr. Carraway, and the reporting clause is marked with a backslash. Every aspect of the markup, of course, is optional, so if you want to keep the reporting clause as part of the narration, just don’t use the backslash. If you want to skip dialogue by certain characters or in certain parts, just don’t mark them up. There is a sample.txt on the GitHub project that is a bit longer than the example above.

The script expects dialog to take the form of ascii double quotes ("some dialogue"), though it also recognizes curly quotes (“some dialogue”). It treats either or " as the barrier that stops the name of the character (Mr. Carraway above). Other dialogue markers require some extra pre-processing.

There are no closing tags, because the script resets dialogue and reporting clauses on blank lines. As a result, a paragraph gets broken up, as above. Since, presumably, the text will be fed into NLTK or some other processing environment, the lack of paragraph integrity should not be a cause for concern. Similarly, a new <2> treats the previous one as closed, much like html does with <li> tags.

text_divider was designed to be quick to use, as I wanted simply to pull out dialogue from novels with character attributes, without having to create a whole TEI version of the novel. This markup, with some useful vim macros (see the README) lets me markup about 100 pp of text in an hour or so, which is a pretty quick way of building up something to feed the processor for differential analysis.