Visualization Workshop

Posted in Talks, Visualization

We’re running a short series of Workshops on Visualization for Humanities Scholars this week at McGill. This page is our quick go-to-guide for the sessions, but it isn’t exhaustive by any means.

On Tuesday we focused on thinking through Visualization, and what it means in terms of ideas of representation. We also briefly looked over Johanna Drucker’s distinction between Data (given) and Capta (taken), and what it might mean for the uses of visualization as creative interpretation in the humanities. And we spent an hour or so exploring D3, Voyant, and the GPS Visualizer – taking peek under the hood in some instances, or just getting used to the tools themselves.

Thursday we went more in depth on using open access tools like Voyant, Density Design Raw, and GPS Visualizer and how they can present different kinds of data. Our examples will use a small corpus of novels by Mark Twain (see below!) which have been preloaded into Voyant, and for which we’ve prepared some Geoparsed data. There’s nothing hard and fast here – the geoparsed data is problematic – we’ll explore cleaning it, and how it might be used in GPS Visualizer or Google Earth.

We’ll also look at some of the statistical data Voyant can output using its export functions from various tools.
For things to read, scroll all the way to the bottom.

Open Data

“The Open Definition sets out principles that define “openness” in relation to data and content. It makes precise the meaning of “open” in the terms “open data” and “open content” and thereby ensures interoperability between different pools of open material. It can be summed up in the statement that: “A piece of data or content is open if anyone is free to use, reuse, and redistribute it — subject only, at most, to the requirement to attribute and/or share-alike.” (http://opendefinition.org/)

  1. make your stuff available on the Web (whatever format) under an open license
  2. make it available as structured data (e.g., Excel instead of image scan of a table)
  3. use non-proprietary formats (e.g., CSV instead of Excel)
  4. use URIs to denote things, so that people can point at your stuff
  5. link your data to other data to provide context

(http://5stardata.info/)

What is ‘Data Representation’?

“By data representation is meant, in general, any convention for the arrangement of things in the physical world in such a way as to enable information to be encoded and later decoded by suitable automatic systems.” (Sperberg-McQueen & Dubin, http://guide.dhcuration.org/representation/)

Using any form of data depends on how it created, and what kinds of structure (even the most unstructured texts have data structures) are involved.

Unstructured vs Structured

Unstructured data has free flowing content where the meaning and nature of the content isn’t inherently described. Sturctured data contains metadata elements within or around the data contents that describe what the data is, and often its relationship to the whole. Craig Bellamy describes it as “Pre-defined and machine-readable, a locatable, sometimes relational ‘data model’ usually of real-world objects.” (http://www.craigbellamy.net/2013/08/14/resources-data/)

Unstructured Structured Semi-Structured
Word, Image, Sound, Video documents Databases, Spreadsheets, Metadata Webpages – have metadata elements (tags), but within them are large unstructured contents, often prose, or images etc.

What is Humanities ‘Data’?

  • Files: Documents, Images, Sound & Video files, Databases, Spreadsheets, etc.
  • Metadata: Data describing other items, or data – Bibliographic tools, file creation information, etc.
  • Others?: Prosopographical & Biographical information, Geographical, Spatial, Chronological, Economic / Commercial, etc.

Any form of humanistic production – any artifact, or information about an artifact. Potentially all encompassing. Can be structured – tabular – data, contained in spreadsheets; can be electronic texts like novels, digital sound, image, and video files; unstructured prose works; semi-structured print…

As humanities scholars we produce data all the time, but we might not think of it as data-per-se because of how we think about the information we collect, how we structure it. Humanities scholars, outside of using bibliographic tools, aren’t usually bound to data standards.

The internet remains largely a realm of unstructured data. Open Data, Linked Open Data, and The Semantic Web, are various routes towards trying to provide some basic structure to the web’s data ecology.

Some Datasets

@McGill

Texts

Thinking through Humanities Visualization

Why Visualize? What does Visualizing Data bring to the Humanities? – new modes of interpretative analysis, new insights, new ways of thinking about what it is we do as scholars, but also how we do it. Our analytical creativity has always rested in prose-based analysis – how do digital media, and the interests they create, offer us new ways of arguing, exploring, and being intellectually creative?

“Information visualization for humanities scholars needs to accommodate a mix of evidence and argumentation. The humanities approach consists not of converging toward a single interpretation that cannot be challenged but rather of examining the objects of study from as many reasonable and original perspectives as possible to develop convincing interpretations.” (Sinclair, Ruecker, & Radzikowska, 2013)

Visualizations allow – in some senses – for quick and easy presentation, they can be live and interactive – exploratory, and give humanities scholars a certain amount of creativity.

Where does the interpretation lie?

Between Data and Capta

Using techniques of empirical sciences can be misleading as they “pass as unquestioned representations of ‘what is’.” Johanna Drucker (2011) argues they’re “observer-independent” where “data” become “mere descriptions of a priori conditions”, effectively making observation the “same as the phenomena observed”. Does the bar graph, and quantification, actually represent the terrain we, as humanists, need to survey and traverse?

The key premise of Ducker’s article is the distinction between data as ‘given’ – a datum, and what she calls capta, the ‘taken’ – a captum. Everything we think of as data, as humanists, should be reconcieved as capta, something that scholars have taken from their sources, acknowleding humanistic inquiry’s “situated, partial, and constitutive character”. Her argument is that humanities scholars do not need is “a set of applications to display humanities “data” but a new approach that uses humanities principles to constitute capta and its display”. (Drucker, 2011)

Statistical & quantitative representations like graphs are useful, but they have a certain blindness in thiking through the “original interpretive framework” upon which they rest. It’s not whether the bar graph does a bad job, it’s whether, for humanists and their analytical needs, it does the right job. Drucker’s first example is a bar graph involving gender – something that statisticians view as fairly stable, but humanists know is much more ambiguous and uncertain. She follows this with another involving publication of new novels in the 19th century – does ‘publication’ mean: “date of appearance, editing, composition, acquisition, review, distribution”?

The simplicity of this inversion is instructive for thinking through visualization as a process of analytical representation – it begs the question, “of what?” Drucker’s point is that when humanists want to represent their work visually, even before getting to the visualizing itself, they need to reflect on whether visualization is even a possible answer for the problem.

  • What will visualization bring that other analytical and rhetorical methods won’t?
  • What information is going to go into the visualization, and what is not? Why?

Visualization and Interactivity

Visualization, then, must have an objective – but the distinction between creative analytics and what we might call Curation or rhetorical presentation of analysis, isn’t clear cut. Which is fine.
Digital Media, moreover, offers new ways of approaching visualizations – interactivity. Unlike print-based schemas, charts, drawings, timelines, etc. with the digital we can interact and reshape the thing we’re seeing. Or not.

We can think about experimentation and creativity as lying at the point of data [capta] creation / management, or in the visualization itself, OR between the two. We can use visualizations themselves to interrogate, show & hide, alter, the very data we’ve initially chosen to work with. This blurs the line between visualization and interface where the representation is itself a hermeneutical tool that calls us to do more interpretation. There’s no clear line.

“A primary index to the quality of visualizations for humanities scholars is the quality and originality of scholarship that the systems support. In each of the projects mentioned here, we have been working with humanities researchers in an effort to produce a useful visual form of the data. Since humanities scholarship is often exploratory, we have also come to believe that interactive formats are in most cases preferable to static ones, since they allow the person using the system to add and subtract elements, experiment with different forms, pursue hunches or insights, and so on. It is therefore important that the expectations of the scholar correspond to the affordances of the visualization. It is important for the scholar to know enough about the visualization tools to understand that the interpretive work is being guided and biased by the data and software.” (Sinclair, Ruecker, & Radzikowska, 2013)

Visualization as Data [Capta?] Curation

A central idea behind humanities visualization is the notion of curation. We create visualizations to display interpretations draw from our research. But the problem is thinking through what humanities data might be, what kinds of standards there are, and how to best represent that information.

From an Information Science perspective curation is very closely linked to preservation and management of resources. From a humanities perspective, it’s about showcasing and describing – telling the story – of that data [capta] as part of the human story we’re primarily interested in as scholars. We’re using the data [capta] to curate humanity, essentially.

This can involve several well-established practices like

  • Scholarly editions
  • Text corpora
  • Text with markup
  • Thematic research collections
  • Data with accompanying analysis or annotation

You’ll notice that all of these are different kinds of representations of data [capta]. “These activities are directed at creating new knowledge; however, research practices aimed at interpretation and criticism can be considered curatorial in nature, particularly if their interpretive information is brought directly to bear on the material being curated.” (Julia Flanders & Trevor Muñoz, http://guide.dhcuration.org/intro/)

Visualization as Creative Analytical Discovery

Images have to argue. Drucker’s distinction between Data and Capta also hits at how humanities scholars might think about how the visualizations they use to curate are in themselves intrepretive acts that creatively argue a case. They have a rhetorical purpose in mind, not unlike the prose scholars write.

Somethings to think about
  • Clustering or Grouping – how does your information group or gather around individuals, genre, topics, words, themes, etc.?
  • Change over Time – Can you create different data sets that represent works or themes over time?

Drucker’s “polemical call to humanists to think differently about the graphical expressions in use in digital environments” is essentially a call away from the quantiative, to the qualitative, as well as the importance of creative exploration as a mode of discovery and interpretation.

What’s out there?

Texts

GIS Historical Places

GIS Data from a Google Spreadsheet

New Forms of Publication

Historical Social Networks – Benjamin Franklin’s Correspondence

Us Music by Genre

Google Big Picture – Music

Resources

Visualization Programs

Visualization Online Tools & APIs

Visualization Libraries

Data Preparation

Open Data with Visualizers

More at http://dirt.projectbamboo.org/categories/visualization
For Texts see http://www.tapor.ca/

Mark Twain Corpus

Here’s a corpus of five novels by Mark Twain, using the open versions available from Project Gutenberg. There are four files for each – the Novel without the Gutenberg Header information, and three files deriving from data gathered through the Edinburgh Geoparser (): a comma delimited file of locations in each novel, bounded by the Mississippi (except for Innocents Abroad), a KML file for Google Earth, and a JSON file of the original output from the Geoparser.

You’ll notice that the place names don’t necessarily match the places in the novel! Can you clean the data using Google Spreadsheets or Open Refine?

Things to do with the Geoparsed Files
  • View the KML in Google Earth – edit it’s contents using a simple text editor. This is an XML file, so be careful!
  • Take the Comma Delimited file, and load it into a Google Spreadsheet, OR cut and paste its contents into GPS Visualizer.
  • Combine the files into new ones, and take them to GPS Visualizer.
Things to do with the Texts and the Corpus

We’ve also provided some files that offer statistical information on the corpus, and on the results generated by the Geoparser.

Load these Comma Delimited Files into Density Design RAW!

There are more frequency files below.

The Individual Files

Preloaded into Voyant at

Visualization Examples

Hacking – What can you do?

Option 1 – Go through visualization examples that use the D3 Javascript Library, Visualize, Many Eyes, or and find three that intrigue you the most. Why are they useful? What kinds of information do they present that would be useful for Humanities scholars? Can you use your browser’s ‘inspect element’ functionality to explore how they’re put together? Can you find the data source they’re using?

Option 2 – Build your own data set to work with using the Open Access resources above. Use Google Spreadsheets, Open Refine, or MS Excel to clean up your data. Explore visualizing your data using GPS Visualizer (if it’s geospatial!), Density Design RAW, TimeMapper, or Google Charts.

Option 3 – Build your own text corpus using the Open Access resources above or use the Mark Twain Corpus (http://voyant-tools.org/?corpus=1392745215963.3330) and…

  • A) Load them into Voyant and explore the visualization tools.
  • B) Load them into Zotero (you’ll need your own laptop for this), and explore them using the Zotero Plugin, Paper Machines

Option 4 – Explore the data sets and visualizations at www.viewshare.org, Many Eyes, or DataMarket

Things to Read