Text analysis tools go back to the first ad-hoc tools that Roberto Busa created for his concordance of the works of Thomas Acquinas and Andrew Booth’s Mechanical Resolution of Linguistic Problems in the 1950s.
Voyant is a suite of analysis and exploration tools for digital texts. Very few contributions to knowledge and technology are unrecognizable from what preceded, and Voyant is no exception: it is largely built on the foundations of text analysis tool design and methodology from over 50 years of humanities computing research. The following are some of the tools that have most influenced text analysis tool development and Voyant in particular:
- Unix command-line tools (grep, sort, uniq, wc, awk, etc.), since the 1970’s. Each unix tool is designed to do one relatively simple thing very efficiently. The power of these modular tools is in how they can be combined in endless ways through the piping mechanism (the output of one tool becomes the input of the next in a chain).
- Oxford Concordance Program (OCP), early 1980’s. OCP provided one of the first examples of a generalized tool for producing concordances, the most historically prevalent activity for text analysis in humanities computing. Although other concordancing programs were available before (such as COCOA), OCP gained wide acceptance. The parallel explosion of personal computing also led to a variant of OCP called MicroOCP (for DOS).
- WordCruncher (mid 1980’s and into 1990’s). Whereas OCP was essentially mainframe (shared) computer software, WordCruncher was built for personal computing in DOS, which meant innovative interface solutions needed to be found. Originally called BYU Concordance Program prior to commercialization, WordCruncher morphed from its DOS form to a Windows-based form in the 1990’s.
- Textual Analysis Computing Tools (TACT) and TACTWeb, 1990’s. TACT was a widely-used DOS-based suite of programs that included some of the usual features for building concordances, frequency lists, collocate lists (frequencies within the context of a keyword), but also some less common features like finding anagrams within a text. Similar to OCP and WordCruncher, TACT required a step of text preparation that enabled fine-grained searching and retrieval (based, for instance, on the presence of specified tags). TACT also provided some navigational features between the different displays that anticipated similar functionality through hypertext (though again, TACT was DOS-based). The Modern Languages Association (MLA) published a volume in 1996 entitled “Using TACT with Electronic Texts” which further extended the reach of this tool and solidified its role as the dominant text analysis tool suite of the 1990’s. TACTWeb was an adaptation of TACT to run on the web that we developed by John Bradley and Geoffrey Rockwell.
- HyperPo, late 1990’s until present. HyperPo was the first web-based text analysis suite available. It provided much of the same functionality as TACT, but with a greater focus on interlinking between the original text being analyzed and the data results (for instance, a user can click on a word in aconcordance to return to that location in the text). Inspired by work on the Oulipo, HyperPo also provides some more experimental and ludic functions (palindromes, text reversal, text entropy, etc.). As a web-based tool, HyperPo was innovative in allowing users to work with texts from a variety of places (pasted into a text box, uploaded from a local drive, retrieved from a URL) and a variety of formats (plain text, HTML and XML). Unlike most of its predecessors, HyperPo doesn’t require preliminary steps by the user for preparing and indexing a text (a paradigm we call immediate analysis). Finally, HyperPo was designed from the outset to be localized (the interface could be translated into different languages) and to support a variety of character sets (UTF-8, ISO-8859-1, etc.) and languages.
- Philologic (2000’s until present): Philologic is a bit different than the preceding examples in that it is really a back-end framework for ingesting, indexing, and retrieving encoded text – it is not as concerned with the end-user interface (what the research might use). Philologic is the back-end system used by front-end interfaces like ARTFL. Philologic is noteworthy in that it emphasizes speed for large corpora while supporting more sophisticated operations on encoded texts and providing common analytic features (concordancing, frequency lists, etc.).
- GATE and LingPipe (2000’s until present). These are two of the most prevalent examples of text analysis frameworks: they are useful both as stand-alone analytic tools for experts and as software libraries for other text analysis tools. Each framework has its respective strengths and weaknesses, but both provide extensive capabilities for such operations as part of speech tagging and entity extraction.
- TAPoRware (mid 2001 to present). Similar to HyperPo, Taporware is a suite of web-based tools that allow users to specify their own texts and begin immediate analytic work. TAPoRware provides a model for extensibility and rapid development of experimental text analysis tools: a simple menu provides access to some 50 tools for performing a variety of operations on different text formats.
- TAPoR Portal (mid 2002 to present). TAPoR is a personalized virtual workbench for doing text analysis by providing a persistent web-based space for invoking remote digital tools with remote electronic texts (users are able to define texts and tools of interest that remain accessible between sessions). Although not itself a text analysis tool, the TAPoR Portal served to push notions of tool interoperability, and especially the value of remote tools exposing public APIs and web services. The TAPoR Portal also provides a mechanism for changing its appearance – or skinning – depending on user profiles and preferences.
- Monk (present). The Monk project is a notable recent attempt to engage in large-scale data mining activities from the perspective of the humanistic – and especially literary – scholar. Among the challenges confronted are 1) how cleanly and extensively encoded do texts need to be to be useful for literary scholarship? 2) how might a user interface be designed in order to expose the sophisticated aspects of data mining while remaining accessible to literary scholars? 3) what types of literary procedures are enabled by work on very large corpora?
- Google (late 1990’s until present). Although it may seem strange to include Google in this list of specialized text analysis software, we do so for three reasons: 1) like the vast majority of search engines, Google is primarily focused on search and retrieval of textual content, which requires text analysis at various stages; anyone using a search engine is also using text analysis; 2) Google set a new standard for simplicity in interface: their default search page is relatively sparse and draws attention to a single search box and a single action button – Google has established a paradigm for a simple user interface to text analysis; 3) Google has agressively pushed embedding its tools in content that’s elsewhere, whether it be the common search box that web authors can include on their pages, web traffic analytics, or even embedded YouTube videos (Google owns YouTube).