The following is a set of notes taken regarding VARD and what it entails.
VARD is software in Java that deals with spelling variations. It is a pre-processor to other corpus linguistic tools such as keyword analysis, collocations and annotations (POS & semantic tagging). It can help improve the accuracy of these tools.
VARD 2 is derived from modern spell checkers. It allows for three different types of processing:
1. Manually- Select the candidate for replacement offered by the system.
2. Automatically- Allows the system to use the best candidate replacement found.
3. Semi-automatically- Allows training of the tool on a sample of corpora.
VARD was designed specifically for Early Modern English spelling variations but can be used to deal with possibly any form of spelling variation in any language by “plugging in” your own dictionary and spelling rules.
Running the Tool:
To run the main interface use run.command on Mac
open command prompt, locate VARD directory and type in:
java -Xms256M -Xmx512M -jar gui.jar
Here, 256 & 512 indicate RAM usage, which can be increased for larger text files (100,000 tokens and up). The higher figure should be no more than half allotted RAM and the lower figure should be half the higher value.
VARD can process a variety of formats
plain text; .rtf; SGML & XML tags
NOT .doc(x) or .pdf
There is no restriction on file extensions, and VARD will always attempt to process any input.
Main User Interface:
Manually normalizing single texts, normalizing randomly selected sample for training, batch processing multiple texts, batch training VARD on previously normalized texts, editing VARD’s setup options.
Split into main text area used during manual normalization; a sidebar to the right which contains several panels for access to various methods for interacting with VARD; a toolbar.
Processing Single Texts:
An interface similar to modern word processing applications.
Spelling variants will be highlighted in yellow. You can then right click on an item for options depending on type of word (variant, non-variants, normalized)
Processed by opening a file or pasting text. First check that the setup is correct for the files and/or texts.
Mark spans as different languages to be processed separately or not at all.
Marked at any language as long as they language is present first either in the list provided in setup or be present in another foreign XML tag in the text.
Select any span of text (click and drag mouse over) and mark it up using “mark as language”.
Methods and Confidence Scores:
Each replacement offered for normalization is given a confidence score based on predicted precisions.
All previous replacements are taken into account as well as current replacement being offered.
These scores combine to form an F-Score.
KV- Known Variants
LR- Letter Replacement
PM- Phonetic Matching
ED- Edit Distance
I contacted the developer of VARD:
“I will be working in conjunction with the Digital Humanities faculty at McGill University to assess wether or not VARD can be used to run variable parameters depending on metadata like date. ‘The mechanics of running the set with different parameters is fairly straightforward with a bit of scripting wizardry, what’s less clear to me is how we can efficiently assess the quality of output for each run. I would also be interested in hearing…about potential strategies to go about creating a sliding scale based on date for spelling variations.”
To which Alistair Baron, faculty research fellow at Lancaster University and the developer of VARD, replied:
This is an issue we’re actually working on a solution for a future version of VARD. Currently, the somewhat non-ideal way VARD can be used with different dates is to train multiple instances of VARD and then process sub-corpora separately. Currently, there’s no mechanism to do a sliding-window type training as you describe…Note, there is a new version (of VARD) coming soon, but this is unlikely to include the sliding-window work.”
In the following post I will summarize a few articles that I found helpful in understanding how researchers are currently using VARD and about the applicability to other languages besides English.