DeRose, Steven J. 1990.
Stochastic Methods for Resolution of Grammatical Category Ambiguity in Inflected and Uninflected Languages.
Ph.D. Dissertation. Providence, RI: Brown University Department of Cognitive and Linguistic Sciences.
This appendix originally began on page 127.
The programs used for the analyses which I have described total about 15,000 lines source code written in the C language. They were developed on an Apple Macintosh II™ computer. 5 megabytes of RAM are required for the larger analyses; the corpora occupy about 7 megabytes of disk space, and the dictionaries and other data occupy several more megabytes of disk space.
Using this configuration, it takes about 30 minutes to tag and disambiguate the entire Brown Corpus. The smaller Greek New Testament is proportionately faster. Using resolved ambiguity sets or a much smaller dictionary slows the system down somewhat.
There are separate packages for the Brown corpus and for the Greek New Testament, which provide consistent mapping between mnemonic and numeric tags, the latter being used internally. Because the personal computer version of the Brown Corpus encodes all tags numerically, the encoding package is most commonly used to convert numeric tags to mnemonic ones before displaying them to the user. On the other hand, the Greek New Testament corpus keeps all tags in their mnemonic forms, and so the corresponding package is most commonly used to convert tags to numeric form when scanning the GNT for analysis. Both packages use simple array based bisection searches on predefined lists of tags.
Because Volsunga uses numeric tags for all internal operations, tag conversion is performed only at the periphery of the system, thus enabling the same programs to be used with either corpus.
The Brown Corpus and the Greek New Testament are stored in quite different formats. The Brown corpus has tags interspersed between words. The GNT keeps tags in separate files, whose line and token structures are isomorphic. That is, the nth word on the mth line of a word file goes with the nth tag on the mth line of a tag file.
The input packages provide means for opening, closing, and reading word-tag pairs from corpus files. As with the tag encoding packages, corpus-specific input processing is handled in isolation at the periphery of the system; Volsunga itself has almost no need to know which corpus it is working on, and that information is obtained from the input package in use.
The dictionary package stores each word form (not including numerals and a few odd tokens), along with its total frequency and its frequency in each attested category. Storing the total frequency provides redundancy which is useful in verifying that the dictionary is valid. The dictionary is built from a tagged corpus. Internally, a hash table provides fast access, and space is saved by storing words in 2 different sizes of records, depending on word length and degree of ambiguity. This strategy is particularly effective because a large percentage of words are short and have only a few known tags, and therefore the smaller records are almost always used. There is a human-readable export format, which makes the data more portable should it be needed for other purposes.
The dictionary program can also manipulate dictionaries in useful ways. For example, it can sort the entries by word, by possible tags, or by frequency; it can sort words as if reversed, in order to facility the induction of productive suffixation rules; and it can extract various portions of the data and generate lexicostatistics.
Transitional probability tables are maintained by another package. Such a table is kept in a simple 2-dimensional matrix, accessed on the basis of the (always numerically-coded) tags whose collocation is to be evaluated. Because the normal Greek tag set has 1180 members, and the resolved ambiguity tag set for English is also very large, this package can be set to store its conditional probabilities in cells of varying sizes. Small tag sets can take advantage of storing real number probabilities; larger sets can drop back to storing integer frequencies for both tags and tag pairs, and calculating conditional probabilities dynamically.
This package, like the dictionary package, gathers its data via scanning a tagged corpus, and can load and unload a human-readable format.
Volsunga per se loads an entire sentence at a time, and assigns all known tags to each word via dictionary lookup. Unknown words are first tested for conformity to known numeric formats (e.g. “$1,” “3.14,” “1,234,567,” “1/2,” etc.). This failing, unknown words are given the best several successor tags based on the preceding word. Options are available for trying all tags, or only those considered to represent open classes.
Once every word has at least one potential tag, the optimally weighted path is determined by the dynamic programming method described in DeRose (1988). Once this is done, the tags assigned are compared to the actual tags read from the corpus, and accuracy counts are kept. The input text can also be written back out with the newly assigned tags and other information.
Despite the program being implemented on a Macintosh™, it foregoes an event-driven interface in favor of a more conventional command-line interface. One of the (relatively few) advantages of this approach is quite significant: it is easy to write a macro package which allows the program to carry out a lengthy series of operations without human intervention. For example, a macro can be defined which tags the entire Brown Corpus and reports accuracy sub-totals for each genre. This macro may then be invoked repeatedly by another macro, which changes tagging parameters between invocations. Another advantage of the command-line interface is that it is closer to the least common denominator of computer systems, easing the task of moving the programs from one system to another should that need arise. All programs provide instructions for command syntax, on request or when an invalid command is typed.