Big Data: Text Mining in the Digital Humanities

Posted in Conferences, Lectures, Talks

McGill University played host to a symposium and workshop on literary text mining May 10th and 11th, 2012. The event, organized by Stéfan Sinclair and Matthew Milner, welcomed a diverse group of recognized scholars including Tanya Clement (University of Texas at Austin), Mark Algee-Hewitt (McGill University), Andrew Piper (McGill), Matt Jockers (Stanford University), and Ted Underwood (University of Illinois). The Thursday morning symposium was attended by approximately 55 people. A short summary of each of the presentations is provided below.

Tanya Clement’s presentation “Sounding it Out: Modeling Orality for Large-Scale Text Collection Analysis” (“Sounding it Out: Modeling Aurality for Large-Scale Text Collection Analysis”) demonstrated her research in representing, visualizing, and comparing the prosody (patterns of rhyme and sound) of the texts of Gertrude Stein. The unique challenge of this project was to produce a meaningful means for the representation of the heard meaning of prosody in a computational form.

Mark Algee-Hewitt ’s presentation “Mapping and Mining: Experiments in Visual Relationality” entered into a theoretical discussion on the implication of data visualization as a process of meaning creation. He emphasized how the natural process of reading is a process of forgetting individual words to create a larger conceptual understanding. He linked this to the process of producing a visualization. A readable visualization is produced through reducing the dimensionality of the data that the computer reads from a text, again an act of forgetting. In addition Algee-Hewitt gave a brief summary over the shortcomings of certain visualization techniques, coming to a rest on his preferred visualization the topographical map.

Andrew Piper’s presentation “The Werther Effect” explored the corpus of Goethe in relation to his work Die Leiden des jungen Werthers through the topographical maps introduced by Algee-Hewitt. Similar to Algee-Hewitt’s brief look at the use of the lexicon of the sublime in Nineteenth century, Piper went into a more detailed look at how visualization tools like the topographical map could be used to create productive textual analysis.

Matt Jockers’s presentation “Tracking Literary Influence in 19th Century Novel” developed an analysis of the influence through the comparison of 500 themes and 190 stylistic elements in a 35,000 work corpus of English (British, American and Irish) authors. The resulting network graphs depicted interesting  correlation of textual similarity and publication date, as well as an interesting pattern of self-similarity in texts as analyzed by the gender of the author.

Ted Underwood’s presentation “The Modest Promise of ‘Big Data’ in Literary Study” worked on establishing a theory for emphasizing the limits of Digital Humanities. Frankly stated traditional scholarship is better suited for close readings of texts, therefore the type of problems taken on by the Digital Humanities must be relevantly suited to the type of tools computational analyses provides.

The various interests of the presenters introduced many different aspects of text mining in the humanities, however the presentation were all similarly interested in the methodological questions that exist at the foundation of the Digital Humanities.

The questions appear to be part of the process of elaborating the scholarly legitimacy of the use of large data sets and computing technologies as an aspect of humanistic research.

Not surprisingly the focus on research methodology in the presentations was also explicitly articulated as an important aspect of drawing out a scholarly practice for the Digital Humanities. It was emphasized that the disclosure of the philosophical and technological rational behind a research methodology is important to develop a sort of academic accountability. These methodological choices are deliberate and meaningfully affect the results of a study.

The rigorous process of explaining and justifying the methodological process is in effect a safe guard against spurious use of computational and statistical tools. “Big Data” will not allow for humanistic arguments to be proved statistically. Instead it is about producing a dialectic between analytic and anecdotal, such that the computational tools of computers can be assimilated into the process of humanistic scholarship. An important aspect of this is to develop meaningful visualizations to render data readable.

A question from the audience brought out the flip side of this. If not for the purpose of statistically value what are the benefits of using the large corpus that text mining allows. Principally the presenters indicated that a large corpus allows scholars to move past the canonical works to develop more nuanced positions towards the larger trends in literary culture.

A lightly annotated transcription of the Twitter stream from the symposium is available at

Following the Thursday morning symposium, an intensive workshop on literary text mining took place during Thursday afternoon and all day Friay. In addition to the symposium presenters, participants included Lindsay Eckert (University of Toronto) and Matt Erlin (Washington University).

“Big Data: Text Mining in the Digital Humanities” was a resounding success, one that bodes well for literary text mining activities at McGill and elsewhere.