Mark Twain Corpus

Posted in

We’ve compiled a small corpus of five novels by Mark Twain, using the open versions available from Project Gutenberg. There are four files for each – the Novel without the Gutenberg Header information, and three files deriving from data gathered through the Edinburgh Geoparser (): a comma delimited file of locations in each novel, bounded by the Mississippi (except for Innocents Abroad), a KML file for Google Earth, and a JSON file of the original output from the Geoparser.

For a list of visualization tools please see:

You’ll notice that the place names don’t necessarily match the places in the novel! Can you clean the data using Google Spreadsheets or Open Refine?

Things to do with the Geoparsed Files
  • View the KML in Google Earth – edit it’s contents using a simple text editor. This is an XML file, so be careful!
  • Take the Comma Delimited file, and load it into a Google Spreadsheet, OR cut and paste its contents into GPS Visualizer.
  • Combine the files into new ones, and take them to GPS Visualizer.
Things to do with the Texts and the Corpus

We’ve also provided some files that offer statistical information on the corpus, and on the results generated by the Geoparser.

Load these Comma Delimited Files into Density Design RAW!

There are more frequency files below.

The Individual Files

Preloaded into Voyant at

Hacking – What can you do?

Option 1 – Go through visualization examples that use the D3 Javascript Library, Visualize, Many Eyes, or and find three that intrigue you the most. Why are they useful? What kinds of information do they present that would be useful for Humanities scholars? Can you use your browser’s ‘inspect element’ functionality to explore how they’re put together? Can you find the data source they’re using?

Option 2 – Build your own data set to work with using the Open Access resources above. Use Google Spreadsheets, Open Refine, or MS Excel to clean up your data. Explore visualizing your data using GPS Visualizer (if it’s geospatial!), Density Design RAW, TimeMapper, or Google Charts.

Option 3 – Build your own text corpus using the Open Access resources above or use the Mark Twain Corpus ( and…

  • A) Load them into Voyant and explore the visualization tools.
  • B) Load them into Zotero (you’ll need your own laptop for this), and explore them using the Zotero Plugin, Paper Machines