DeRose, Steven J. 1990.
Stochastic Methods for Resolution of Grammatical Category Ambiguity in Inflected and Uninflected Languages.
Ph.D. Dissertation. Providence, RI: Brown University Department of Cognitive and Linguistic Sciences.
Footnotes are here numbered by the original page on which they began, a dot, and then the original footnote number. There are links to the footnotes from their original origin points in the text.
2.1Bradley (1983, p. 2) asserts that
most of the vocabulary (of English) is ambiguous as to grammatical category. Indeed, any word form can conceivably be used as a proper noun, in which case all words are categorially ambiguous.
8.1This use of the term
neologism is specialized: it refers to any word encountered in text which is to be tagged, but which is not included in the computer’s dictionary.
8.2These approximately correspond to the Brown Corpus tags NN, VB, JJ, RB, PP (and several sub-tags), CS and CC, UH, and IN, respectively.
11.1There are some instances, for example kako:n, which can be either a genitive plural noun (any gender, meaning ‘bad’), or an adverb (meaning ‘badly’).
11.2Smyth is in fact a grammar of Classical Greek; the forms shown do not differ from those of Koiné, except that the vocative was moribund in New Testament times.
19.1Oostdijk (1988) discusses some ways in which corpus linguistics can also contribute to the study of language and sub-language variation.
23.1The frequencies of forms of the verb time are: time/VB - 1; times/VBZ - 1; timed/VBD - 2; timed/VBN - 7; and timing/VBG - 5.
24.1As I discuss later, these dictionaries omit some forms, such as numbers.
30.1These terms were suggested by my colleague John Thomson.
34.1Thomason (1965, pp. 137-139) describes a
stochastic automaton which probabilistically accepts input strings; that is, any particular string is only accepted some percentage of the times that it is presented. Thomason suggests uses in
stochastic classification experiments, where recognizers may be compared on the basis of their long-run effectiveness. These automata, however, are not useful for my purposes, because one must recognize each sentence every time it is presented.
36.1With natural language it is advisable to raise all probabilities to values above 0, to allow effective processing in cases of novel or erroneous constructions, or cases not discovered during normalization.
38.1The numbering of orders follows Shannon (1951, p. 51) and Kučera (1975, p. 128). Miller and Chomsky (1963, pp. 427-429) refer to
k-limited Markov sources; their
k corresponds to a k+1 order approximation, as they point out (p. 428). Abramson (1963, pp. 22-26) does not consider the zero order case properly a
Markov model, but notes that Markov models simplify to it as expected.
39.1If the sentence boundary is not considered a symbol in its own right, an additional start state is needed. On the other hand, many of the internal arcs may have probability zero (more so at progressively higher orders), and may be omitted.
42.1Other units may be defined in terms of logarithms to other bases. Abramson (1963, p. 12) mentions
nats, which are base e, and
Hartleys, which are base 10. Respectively, these units approximately equal 1.44 and 3.32 bits.
46.1These degrees of ambiguity are based on Francis and Kučera (1983), except that
waters is not attested as a verb, though other inflected forms of the same lemma are.
46.2Technically, a span network is a tree. Cf Liu (1977), pp. 82-161 for discussion of fundamental graph theory.
51.1Previous writers have seldom addressed the problem of evaluating accuracy when the text being tagged algorithmically has not also been tagged via a consensus of human linguistic intuition. Presumably, accuracy figures in (e.g.) Klein and Simmons (1963), Booth (1985), and Church (1988) are based upon human spot-checking, and thus depend upon the thoroughness and accuracy of the checks.
52.1Kučera (personal communication) has noted that the best results for the Brown Corpus were obtained when a final consistency check was made by a single individual. Similarly, Church (personal communication) found that tag assignments by different native speakers sometimes have a higher error rate than stochastic algorithms.
55.1Johansson (1980) discusses some specific differences which have been discovered through comparison of the Brown and LOB corpora.
62.1For example, this is the stated policy of the American Heritage dictionary (1982, p. 47), which includes
all the parts of speech at one entry word....
63.1Choueka (personal communication) notes that on p. 152 the figure of 219,000 tokens should read 119,000, and that there are other printer’s errors to be watched for. Also, an early printing of the journal entirely omitted figures and tables.
68.1In the more common numbering; Abramson calls them second order.
73.1A collapsed set of 33 tag
cover symbols is used instead of the full set of over 100 word level tags (Garside and Leech 1985, p. 168).
75.1I omit here the initial learning of grammatical categories per se. Similar methods may be applied; see also Kučera (1981) and Jelinek (1985) for discussion of systems which can learn grammatical categories.
79.1Newman ( p. 246) refers to Robinson (1975, 1980), but these items do not appear in her bibliography. Newman (personal communication) indicates that the references are from a series of SRI Technical Reports available in microfiche.
80.1Early plans for the tagging suite are described in Garside and Leech (1982); an anticipated context-frame rule program similar to TAGGIT (see p. 115) was apparently not needed.
81.1Church (1988) refers to these data as
Lexical Probability Estimates.
82.1With 88-100 tags, the number of potential triples is of the same order as the length of the Brown Corpus. A reliable table would thus have to be based on a much larger tagged corpus. Church (1988) has investigated the use of third order probabilities in disambiguation, and Jelinek (1986) has pursued similar approaches.
84.1There is an oversight on p. 136 of Church (1988), where he comments in passing that the Brown Corpus tags were
assigned laboriously by hand over many years; although careful verification of the tags was indeed a lengthy process, Greene and Rubin’s TAGGIT program (1971) performed most of the tagging correctly.
88.1Calculations of letter, bigram, and trigram frequencies have similarly excluded certain tokens, specifically words containing hyphens, apostrophes, and numbers (Solso and King 1976; Solso, Barbuto, and Juel 1979; Solso and Juel 1980).
88.2The frequency table in Francis and Kučera (1983, pp. 534ff) lists 55,291 sentence-ends, 3,383 hyphens, and 58,029 commas. The total token count reported differs by 2 from that of Francis and Kučera (1983, p. 533), namely 1,013,644 words plus 123,210 punctuation marks, or 1,136,854.
89.3Except for a few small or
round numbers, and numbers labelling recent years.
90.4Owen (1987) discusses some probable effects of tag set differences on tagging accuracy.
99.5They are: the, of, and, to, a, in, that, is, was, he, and for (see Kučera and Francis (1967)).
99.6See for example Gigley (1982), Morton (1982), and Wood (1978, 1980, 1982). These authors describe various experiments in which they created faults in NLP models such as parsers, and observed specific performance consequences. Cottrell and Small (1984, esp. p. 94), also discuss the lesionability of their parsing model.
103.7The text was drawn from samples kindly provided by the Providence Journal, and published in the last few years, more than 20 years after the texts which comprise the Brown Corpus. This gap should provide a harder test than would contemporaneous text, due to vocabulary and style changes.
103.8Difficulties of course arise at sentence boundaries, where words other than proper nouns are capitalized, and sentence boundaries themselves cannot always be easily assigned. Nevertheless, even a simplistic rule such as
all capitalized words not following periods mark proper nouns can have a fairly high rate of accuracy.
104.9Examples of words which display particular kinds of ambiguity (and which are therefore subject to particular classes of errors) may be found by consulting Appendix 4, which lists English ambiguity sets with an example word representing each set.
108.10Half of all ambiguity sets are represented in the Brown Corpus by only one word form each. It follows that the RAS table encodes substantial information about specific senses of many specific words.
Thet is a dialect form for
that. It occurs in Genre N,
Adventure and Western Fiction:
Mrs. Roebuck thought Johnson was a sweet bawh t‘lah lahk thet. . . .
111.12Although though, which normally can be CC, CS, or RB, also occurs as VBD (probably an original-source typographical error for thought).
117.13The tags are AP$ (9), CD$ (5), DT$ (5), JJ$ (1), RB$ (9), and RN (9), as indicated via searching the dictionary already described; the table in Francis and Kučera (1983, pp. 534ff) shows the same totals except for one additional CD$ (probably on a number not in my dictionary), and no JJ$. The single occurrence of JJ$ is in a title (so
JJ$-TL), on the word
Specifically, "Alexander the Great's".
129.1See Coombs, Renear, and DeRose (1987) for a detailed treatment of why human readable and mnemonically structured files are preferable for a wide range of computing applications. The principles described there greatly influenced the design of external formats for the data used by Volsunga.
132.1The list of tags is based on Kučera and Francis (1967), pp. 23-25. Tags shown in italic type are not shown in that source. Not shown above are “-HL,” a marker suffixed to the tags of words in headlines, and “-TL,” a marker suffixed to the tags of words within titles. Tags for conjunct words are conjoined by “+,” except for the tag “*”, which is affixed without a delimiter.Next: Bibliography