Chapter 3: Previous research 58.1

Previous work is considered here under three headings: first, research concerning grammatical categorization using non-stochastic methods; second, stochastic methods in other areas of natural language processing; third (and most relevant), research with stochastic methods specifically for grammatical categorization.

Grammatical categorization58

Klein and Simmons (1963) describe an eclectic method directed towards initial tagging with a small tag vocabulary. A primary goal was avoiding the labor of constructing a very large dictionary of words and their potential tags (p. 335). This saving was a consideration of more import then than now (the program occupied 14K, which was half the total memory of the machine on which it ran). In lieu of a large dictionary, the method used affix analysis and some collocational criteria.

The Klein and Simmons algorithm uses 30 categories, which is rather fewer than in later efforts. The authors claim an accuracy of 90% in tagging. The algorithm proceeds as follows. First, the program looks up each word to be tagged in several dictionaries. The first dictionary contains about 400 unambiguous function words. Other dictionaries contain about 1,500 words which are exceptions to the computational rules used (p. 339).

The program next applies several limited analyses of form. Numerals are tagged ADJECTIVE. Then apparent plural and possessive endings are discovered. Finally, words are matched against a list of suffixes. The rationale for creating the suffix list is not described, but the implication appears to be that it was developed by repeatedly examining tagging errors and repairing them.

Last of all, context frame tests are applied. These work on scopes bounded by unambiguous words, as do later algorithms. However, Klein and Simmons impose an explicit limit of three ambiguous words in a row. For each such span, the pair of unambiguous categories which bound it is used to look up a list of all tag sequences known to occur between those bounding tags. All such sequences of the correct length become candidates. The program then matches the candidate sequences against the ambiguities remaining from earlier steps of the algorithm. In some cases, only one sequence is possible, and it is then used, disambiguating the words within the span.

The samples used for calibration and for testing were quite limited. First, Klein and Simmons performed hand analysis of a sample [size unspecified] of Golden Book Encyclopedia text (p. 342). Later, when it was run on several pages from that encyclopedia, it correctly and unambiguously tagged slightly over 90% of the words (p. 344). Further tests were run on small samples from Scientific American and Encyclopedia Americana. Klein and Simmons assert that original fears that sequences of four or more unidentified parts of speech would occur with great frequency were not substantiated in fact (p. 3). But the relatively small set of categories reduces the degree of ambiguity. Further, the extremely small and non-heterogeneous test sample would not reveal either low-frequency ambiguities or moderate frequency phenomena such as long spans of ambiguous word forms.

The relationship between span length and frequency is a natural one. The total numbers of spans in the Brown Corpus, for each length from 3 to 19 (including the unambiguous bounding nodes), are: 397,111; 143,447; 60,224; 26,515; 11,409; 5,128; 2,161; 903; 382; 161; 58; 29; 14; 6; 1; 0; 1 (DeRose 1988, p. 33). This relationship is logarithmic.

Greene and Rubin developed an algorithm to tag the Brown Corpus. The 86 tags which this algorithm uses have, with minor modifications, also been used in subsequent attempts including both CLAWS and Volsunga (see below). The rationale underlying the choice of tags is described on pp. 3-21 of Greene and Rubin (1971). Francis and Kučera (1982, p. 9) report that the algorithm, as implemented in a program called TAGGIT, correctly tagged approximately 77% of the million words of the corpus, with disambiguation of the remaining 23% performed manually. The accuracy figure was determined by hand-checking portions of the Corpus (Francis 1980, pp. 202-203).

This accuracy is substantially lower than that reported by other researchers, but with good reason. Klein and Simmons’ higher accuracy may be attributed to their much smaller set of categorial distinctions. The high accuracy of methods developed after Green and Rubin’s has been possible in large part because the tagged Brown Corpus has provided a means of discovering lexicostatistical properties useful in tagging.

TAGGIT divides the task of category assignment into initial (potentially ambiguous) tagging, and disambiguation. The initial tagging is carried out as follows: First, the program consults an exception dictionary. Among other items, this contains all known closed-class words.

Then it handles various special cases, such as words with initial $, contractions, special symbols, and capitalized words. Hyphenated words are split up, numerals are tagged, and the prefix ‘UN’ is interpreted. Next, TAGGIT checks the word’s ending against a suffix list, and assigns an appropriate tag or tags if possible. S as an ending is treated specially (because it is both common and ambiguous). If TAGGIT has not assigned some tag(s) after these several steps, the word is tagged NN, VB, or JJ, in order that the disambiguation routine may have something to work with (Greene and Rubin 1971, p. 25).

The exception dictionary includes 2,860 words. The suffix list includes 446 strings (Francis 1980, p. 201). Some of these entries are special cases, which would receive inappropriate tags given the algorithm just described. The lists were derived on the basis of lexicostatistics of the (then untagged) Brown Corpus.

After tagging, TAGGIT applies a set of 3,300 context frame rules. Each rule, when its context is satisfied, has the effect of deleting one or more candidates from the list of possible tags for one word. If the number of candidates is reduced to one, disambiguation has been successful (albeit not correct in every instance). Each rule can include a scope of up to two unambiguous words on each side of the ambiguous word to which the rule is being applied. This constraint was determined as follows (Greene and Rubin 1972[1971], p. 32):

In order to create the original inventory of Context Frame Tests, a 900-sentence subset of the Brown University Corpus was tagged . . . and its ambiguities were resolved manually; then a program was run which produced and sorted all possible Context Frame Rules which would have been necessary to perform this disambiguation automatically. The rules generated were able to handle up to three consecutive ambiguous words preceded and followed by two non-ambiguous words [a constraint similar to Klein and Simmons’]. However, upon examination of these rules, it was found that a sequence of two or three ambiguities rarely occurred more that once in a given context. Consequently, a decision was made to examine only one ambiguity at a time with up to two unambiguously tagged words on either side. The first rules created were the results of informed intuition.

As already noted, this algorithm tagged somewhat over 3/4 of all words correctly and unambiguously. The output of the program was a file, one word per line, showing each word followed by the tag or tags remaining. Human editors then chose from among the several tags available in cases of residual ambiguity. In some few cases, the correct tag was not in the set at all, and the editors would then specify it explicitly.

Yaacov Choueka and Serge Lusignan (1985) describe a method of context-based lemmatization, implemented via human native speakers rather than computationally. By lemmatization of a word they mean relating it to the corresponding entry in a standard comprehensive dictionary of the language L chosen once and for all in advance (Choueka and Lusignan 1985, p. 148). The original printing of this article contained a number of errors; my discussion here is based on a more accurate printing kindly provided by Yaacov Choueka.

It would seem that lemmatization is a problem quite distinct from that of grammatical categorization. And so it is when viewed from the standpoint of American lexicographic practice, in which a single dictionary entry includes all categorially distinct uses of a word, if they are homographic and (roughly) synonymous.62.1 It appears, however, that the rule is quite different in at least the particular French dictionary used by Choueka and Lusignan. For they note that a dictionary entry (p. 148)

will always contain three components: the lemma (DOG), some morphological attributes (noun, singular), and a semantical [sic] denotation (four-legged animal). Two dictionary entries will generally differ by their lemmas: DOG, CAT. They can share, however, the same lemmas and differ only in the morphological attributes; this can happen either with different semantic fields: FALL — verb (act of falling), FALL — noun (season), or similar ones: (to) DRESS — verb, (a) DRESS — noun. Finally, they can differ in their semantic fields. . . .

Given this approach to separating entries, their task involves categorial as well as semantic disambiguation. That is to say, in order to assign a categorially ambiguous graphic word to the correct entry, it must be categorially disambiguated, just as in order to assign a semantically ambiguous word to the correct entry, it must be semantically disambiguated.

The text used was part of the personal diary of Lionel Groulx, a historian and one of the leading figures of Quebec nationalism at the turn of the century . . . (p. 150). It consisted of 215,000 occurrences, representing 17,300 forms. 63.1 Carroll (1967, p. 417) provides theoretical and observed type counts of 26,218 and 23,655 for a Brown Corpus subset containing 253,538 tokens; thus a form count of 17,300 is surprisingly small, considering the presence of inflections in French.

Investigations were carried out on 31 of the most frequent ambiguous words, selected to be representative.

The distribution of ambiguities (in the sense defined above) is supplied (p. 151), but is of limited reliability due to the small set of words investigated. There were 23 two-way ambiguities, 7 three-way, and 1 four-way. Using the three grammatical categories noun, verb, and other, the authors report that 42% (presumably 13) of the words were noun-verb ambiguities, 26% (8) noun-other, and 16% (5) verb-other, while 16% (5) were (semantic) ambiguities with no categorial distinction. 48% of the ambiguities were within the same semantic field. 55% of the ambiguities were highly skewed, in the sense that a word’s favored meaning accounted for over 90% of its occurrences.

The authors report the figures shown here in Table 7 for the ambiguity and the token coverage of the highest ranking types (p. 152):

Table 7: Type coverage (after Choueka & Lusignan 1985)
Types	%Tokens	%amb. types	%amb. tokens
100	55%	25%	23%
500	71%	?	26%

They predict, on unclear grounds, that a total of 30% of all tokens will be ambiguous (the figure for strictly categorial ambiguity for the Brown Corpus is over 48%; for the Greek New Testament, 47%). They also claim an upper limit of 20% on the proportion of ambiguous types, and suggest 15% is a better estimate (the figure for the Brown Corpus is at least 11%, more likely well over 20% in general; for the Greek New Testament, 9%).

Choueka and Lusignan had native speakers lemmatize words, given at most 2 words of context on each side of the word to be lemmatized. Interestingly, this is the same scope limit which the TAGGIT system used (Greene and Rubin 1971). The speakers were supplied with the set of possible lemmas for each word to be lemmatized, and a one-word context. In a particular instance, the speakers could choose to examine the two-word context as well.

An editor then listed all possible lemmas for each of the 31 graphemic types (their term is L-forms). Next, one or more members of a group of six native speakers was

asked to choose, for every word W assigned to him, the type of context (pre-, post-, or symmetric) that would be most effective for disambiguation; the list (automatically produced by the computer) of all the [instances of] contexts of W of that type [i.e., the type which the speaker considered most effective] was then handed to him.

Speakers chose to see the left context 68%, the right 22%, and both sides 10% of the time. The result was that (Choueka and Lusignan 1985, p. 152)

Of the 2,841 different 1-contexts listed and examined, 77% were disambiguation [sic — probably disambiguated] by the informants [apparently without requesting 2-word contexts]. For the remaining 23% of non-disambiguated 1-contexts, the different 2-context [sic] were examined and 63% of them were disambiguation [sic].

The error analysis is unclear. Lemmatization errors were detected by a post-editor, and classified as essential and unessential (p. 155); the latter were errors not related to the shortness of context. The short-context method as implemented via native speakers resulted in 1.6% (28 of 1,763) total incorrect lemmatizations, and 0.6% (11?) essential errors.

Choueka and Lusignan find it rather paradoxical that there was a .7% error rate for 1-word contexts, but a 3.9% error rate for 2-word contexts. However, the experimental design seems to me to make this unsurprising. The speakers are given the 1-word context first. They only see the 2-word context when they have already (by their own intuition) failed to decide given the 1-word context. Thus, the sample set from which the 2-word context’s error rate is calculated is highly biased toward difficult constructions. Particularly in the case of categorial ambiguities (which constitute 84% of the cases), considering non-adjacent words is only slightly likely to resolve an already bad situation; and the 2-context error rate is the measure of that slightness, not of the effectiveness of 2-contexts per se.

The authors break down error rates by whether a word includes a noun lemma among its possibilities; ambiguities of this sort had an error rate of 2.5%, while others had 0.4%. Verb-verb (semantic) ambiguities were also somewhat difficult cases, at 1.3% (p. 156).

The work of Choueka and Lusignan (1985) is of significant interest because it provides quantitative information on the characteristics of ambiguity in an additional language, in particular one with rather more inflectional morphology than English. Also, the effectiveness with which native speakers can utilize even the shortest of contexts is impressive. It is, however, unfortunate that no breakdown was given of errors with categorial versus semantic ambiguity. Also, the way in which morphology was handled was not discussed at all; was each inflected form considered a different word? If so, it is significant that so high a degree of categorial ambiguity was found; if not, the procedures by which inflected forms were reduced are relevant to evaluating the overall method.

Choueka has also conducted extensive research on the automatic analysis of Hebrew morphology. Hebrew poses particularly complex problems, because the number of forms of a given verbal lemma may be over 20,000 (Choueka 1980, pp. 162-163). The presence of infixes as well as prefixes and suffixes complicates analysis further, as do phonological processes which may in effect conceal similarities between word forms, and a highly productive system of compound formation. The methods used for reducing fully inflected forms (of which Hebrew may have 100 million (ibid)) to compounds (of which there are a manageable 2.5 million) is described in detail in Attar, Choueka, Dershowitz, and Fraenkel (1978); it mainly involves generating potential divisions of the word form on the basis of prefix and other lists. The morphological analyzer is applied in generating the set of word forms to be used as retrieval keys for a large database. A request for testify, for example, expands to 336 specific forms which are found in the database 53,000 times (Choueka 1987, p. 32). Choueka (1980, p. 158) found that the method generally retrieved 98% of the documents relevant to a given request, though it tended to over-retrieve to varying degrees, with an average of about 86% of returned references being considered relevant.

Stochastic methods in other areas of CL67

Stochastic methods have been applied to natural language at various levels. The work discussed in this section is organized very roughly in order of those levels, i.e., working up from letters or sounds toward syntax and semantics.

Shannon (1951) asked human subjects to predict the next character of a partially presented sentence. The accuracy of their predictions increased with the number of characters presented, though as one would expect, errors occur most frequently at the beginning of words and syllables where the line of thought has more possibility of branching out (Shannon 1951, p. 55). Since entropy is a measure of the uncertainty involved in successive symbols of a message, it can be measured directly by measuring native speakers’ uncertainty.

From such data, Shannon calculates the entropy of English characters for preceding contexts of from 0 to 15 letters. The upper and lower bounds are monotonically decreasing, and are nearly level from 10 letters on. The 10th order entropy is then 1.0<= H₁₀ <= 2.1 bits per letter. Shannon also shows 0.6 <= H₁₀₀ <= 1.3 bits per letter. Burton and Licklider (1955) extended this experimental work to preceding contexts of 16, 32, 64, 128, and 10,000 characters. The entropy found was stable after about 32 characters of context.

Solso and King (1976), Solso, Barbuto, and Juel (1979), and Solso and Juel (1980) provide tables of the relative frequencies of letters, letter pairs, and letter triples in English, based on the Brown Corpus. Their counts are particularly useful because they include separate totals for various positions within words. Also, they provide versatility figures, where the versatility of a letter or letter pair is the number of different word forms in which it occurs.

Abramson (1963, p. 33-38) presents text generated by Markov models of several languages at the character level. Third order approximations 68.1 were determined for the languages in question, and the resulting FSAs generated sample text in accordance with the probabilities found. The results for four different western languages follow (modified by conversion to lower case, and with spaces in place of underscores):

jou mouplas de monnernaissains deme us vreh bre tu de toucheur dimmere lles mar elame re a ver il douvents so
bet ereiner sommeit sinach gan turhatt er aum wie best alliender taussichelle laufurcht er bleindeseit uber konn
rama de lla el guia imo sus condias su e uncondadado dea mare to buerbalia nue y herarsin de se sus suparoceda
et ligercum siteci libemus acerelin te vicaescerum pe non sum minus uterne ut in arion popomin se inquenque ira

These samples, simplistic though the model which generated them may be, are easily identifiable as to language. The first three are French, German, and Spanish. Abramson leaves identification of the fourth as an exercise for the reader; it is almost certainly Latin. Although this method does not apply directly to the task of grammatical categorization, the samples illustrate the effectiveness of Markov models in modelling salient characteristics of natural language.

Miller and Chomsky (1963) discuss similar approximations to English, drawn from Shannon (1948) and Miller and Selfridge (1950):

Zero order letter approximation (equiprobable):

xfoml rxkhrjffjuj zlpwcfwkcyj ffjeyvkcqsghyd qpaamkbzaacibzlhjqd.
First order letter approximation:

ocro hli rgwr nmielwis eu ll nbnesebya th eei alhenhttpa oobttva nah brl.
Second order letter approximation:

on ie antsoutinys are tinctore st be s deamy achin d ilonasive tucoowe at teasonare fuso tizin andy tobe seace ctisbe.
Third order letter approximation:

in no ist lat whey cratict froure birs grocid pondenome of demonstures of the reptagin is regoactiona of cre.
First order word approximation:

representing and speedily is an good apt or came can different natural here he the a in came the to of to expert gray ural here he the a in came the to of to expert gray come to furnishes the line message had be these.
Second order word approximation:

the head and in frontal attack on an english writer that the character of this point is therefore another method for the letters that the time of who ever told the problem for an unexpected.
Third order word approximation:

family was large dark animal came roaring down the middle of my friends love book passionately every kiss is fine.
Fifth order word approximation:

road in the country was insane especially in dreary rooms where they have some books to buy for studing greek.

Miller and Chomsky suggest (1963, p. 429) that the sequences produced by k-limited Markov sources cannot converge on the set of grammatical utterances as k increases. This may well strictly be true; but as with Abramson’s examples the most striking feature of the data is how closely Markov models can model actual language, despite being based on entirely localized and asemantic probability tables. And it is certainly the case that stochastic methods model a wide range of actual natural language observed (i.e., performance) with substantial accuracy. Hockett (1961, p. 220) comments that because hearers have only partial knowledge during sentence processing, for the hearer, then, a grammatical system must be viewed as a stochastic process. He goes on to argue, contra Chomsky, that Markov models can approximate the grammar of a language as accurately as one wishes.

Oshika et al. (1988) apply Markov modelling of a similar kind to sorting proper names by language of origin. They note that current techniques for handling variant spellings, such as SOUNDEX, are relatively ineffective for non-European names. One example they give is that SOUNDEX rules delete most vowels, but many Chinese surnames are distinguished only by vowels.

A separate Markov recognizer for each language is used to assign a probability to a name, expressing how likely a string of characters the name is in the language. The most likely language is chosen, and then language-specific rewrite rules are applied, which are intended to model likely spelling variations in the chosen language. Tests improved name retrieval from a database from 69% to 80% on the relatively small sample reported.

Miller and Selfridge (1950) report investigations similar to Shannon’s, but exploring native speakers’ knowledge of collocational probabilities for words, rather than characters. The same experimental paradigm could not be used, because speakers may take a long time to guess the next word of an utterance successfully, whereas they should never require more than 26 tries for letters. In addition, no adequate estimates of such probabilities were available, which could be used to generate random text on which to test the subjects. Particularly above second order, such a tabulation would be exceedingly long and tedious to compile (p. 180).

Therefore Miller and Selfridge make use of native speaker knowledge to generate various Markovian approximations to natural language text (p. 180):

At the second order, for example, a common word, such as he, it or the, is presented to a person who is instructed to use the word in a sentence. The word he uses directly after the one given him is then noted and later presented to another person who has not heard the sentence given by the first person, and he, in turn, is asked to use that word in a sentence. . . . This procedure is repeated until the total sequence of words is of the desired length.

This method is used to generate sentences with varying lengths, and with varying orders of approximation up to 7. The first order approximation is obtained by randomly choosing words from the set of all tokens generated by speakers while providing the higher approximations. The intent is to generate text with varying degrees of what can loosely be called ‘meaningfulness’ (p. 179). Having given a formal scale for some of the range between nonsense and natural text, they measure people’s ability to recall the sentences. The percentage of words correctly recalled increases as the order of approximation is increased, and decreases as the length of the list is increased (p. 182). For the shortest sentences (10 words) recall is nearly as good for second order text as for natural text. Longer sentences require higher orders of approximation in order to be recalled as well as natural text; with 50 word sentences the recall percentage is nearly the same for fifth and seventh order and for natural text.

An example of text generated by the fifth order model is (p. 183) old New-York was a wonderful place wasn’t it even pleasant to talk about and laugh hard when he tells lies he should not tell me the reason why you are is evident. Such higher order examples are close enough to sounding natural that they might conceivably pass for mildly aphasic speech, or perhaps even for normal speech with (not unusual) pragmatic elements such as interruptions and occasional shifts of thought. Undoubtedly the semblance of normality would decrease in the context of a larger discourse, but even so the model’s effectiveness despite lack of anything which would normally be considered linguistic competence seems to me a strong argument for further investigation of stochastic methods in general. Miller and Selfridge suggest that their results indicate that meaningful material is easy to learn, not because it is meaningful per se, but because it preserves the short range associations that are familiar. . . .

Damerau (1971) reports experiments with generating text using word-based Markov models up to fifth order for English. He obtained grammaticality judgements from native speakers, and found that the degree of grammaticalness, specially defined, did increase with increasing order of approximation. This finding lends additional support to Miller and Selfridge’s (1950) finding that texts generated by models of varying orders vary correspondingly in their acceptability to the human language processing system. Damerau also found that native speakers’ judgements of grammaticality on natural text varied greatly, emphasizing the need for particular care in the use of such judgements.

Jelinek (1985, 1986) discusses the use of statistical methods to select an optimal interpretation of a given acoustic input into words, given knowledge of probabilities of co-occurrence. He also presents some solutions to the difficulty of estimating collocational probabilities from a corpus which, though very large, is inevitably too small when used for deriving third order Markov models for word sequences.

Appendixes deal with the problem of optimally partitioning a vocabulary into equivalence classes: discovering word classes that might prove a better basis of language modeling than conventional parts of speech (Jelinek 1985, p. 41). The word classes are chosen to maximize their predictive power, based upon a sophisticated mathematical model of utterance probabilities.

Beale (1985) and Garside and Leech (1985, 1987) report briefly on the CLAWS categorial disambiguation system (see below), then go on to present a derivative method for assigning clause- and phrase-level hypertags or T-tags. These tags form a labelled bracketing of higher level structures in the corpus.

Table 8 shows the major clause and phrase tags employed in this system. Most of them have finer subdivisions. See Beale (1985, pp. 164-165) for further details.

Table 8: Syntactic tags, Beale (1985)
T-tag	Meaning
A	As-clause
D	Determiner phrase
E	Existential THERE
F	Finite-verb clause
G	Germanic genitive phrase
J	Adjective phrase
L	Verbless clause
M	Number phrase
N	Noun phrase
P	Prepositional phrase
R	Adverbial phrase
S	Sentence
T	Non-finite-verb clause
U	Exclamation or [Grammatical] . . . Isolate
V	Verb phrase
W	WITH clause
X	NOT separate from verb
Y	‘Wild card’

Garside and Leech (1985, p. 166) credit Atwell (cf Atwell 1983) with formulating the method for applying the techniques of CLAWS to syntactic level tagging. Phrasal and clausal categories and boundaries are assigned on the basis of the likelihood of word tag pairs opening, closing or continuing phrasal and clausal constituencies [sic] (Beale 1985, p. 160). Each tag pair has a particular set of boundaries which it entails. 73.1 The T-tag table was initially constructed by linguistic intuition (Beale 1985, p. 162).

Extraneous markers can be inserted by this process, and there is no guarantee that open and close brackets will match. Therefore after a text has been annotated via the boundary lookup table, any unclosed bracketings are closed from right to left. The close operation involves (apparently non-deterministic) tree construction. CLAWS-like network probability maximization is then performed to eliminate inappropriate taggings.

Since no clausally tagged corpus was available, about 1,500 (Beale 1985, p. 161) to 2,000 (Garside and Leech 1985, p. 166) sentences were manually parsed according to a Case Law Manual (unpublished manuscript by G. R. Sampson, cited in Beale 1985). Presumably, the lookup table was then derived by statistical analysis of the parsed sentences.

An evaluation of the method’s accuracy would be of interest, but has not been reported as of this writing; also, the statistics of boundary distribution should be analyzed. A clear drawback of the method is that it only inserts individual left or right boundaries, although each constituent in a labelled bracketing has scope, and hence has two related ends. Grammatical categories, on the other hand, are generally though not always predicable of a single point, namely a word. Thus, the problem of introducing unmatched tags is qualitatively new, and may diminish the effectiveness of low order collocational methods. I am not entirely convinced that this method provides significant advantages over other parsing techniques.

Anderson and Murphy (1986), working in the context of neural modelling, present a method by which high-dimensionality vectors can learn concepts. Though not a stochastic method in the usual sense, it is based upon a formal mathematical model which can be construed as probabilistic, and hence is most appropriately presented at this point. In short, once trained with a number of stimuli similar to a particular prototype, the model can recognize the prototype (as well as novel distortions of it). Further, it can make conceptual associations to given stimuli, which are appropriate given the previously observed contexts. See the article for discussion and references on human cognitive analogs to this process.

Although a discussion of the applications of neural modelling techniques in general, or even of Anderson and Murphy’s work in particular, is beyond the scope of this dissertation, I will pause to make a few comments on possible connections, and hope to encourage further research in this area.

Anderson and Murphy (1986, p. 330), and also Kawamoto (1985), have applied parallel associative memory models to the resolution of lexicosemantic ambiguity in English. A memory vector was trained by the presentation of many stimuli, which can be likened to sentences in that they contained co-occurrences of the lexical items. When lexical items were later presented in isolation or in semantically coherent groups, the system correctly retrieved the meanings which were associated with the relevant readings of the lexical items presented.

For any particular sample, of course, other methods may be more effective; one can always make explicit a particular set of ambiguities and generalizations about word relationships. The strengths of the neural modelling approach lie in other areas. First, such a system learns without subjective intervention. Also, it is robust in cases of aberrant or non-typical inputs. And third, it is robust in the face of damage, because information is not localized.

In relation to categorial disambiguation, one can view each sentence heard as a claim about the possible collocations of categories. The notions of adjacency and proximity are highly salient in neural modelling; let us assume the very simple situation in which, as sentences are heard, each successive n-tuple of categories is composed into the memory state vector. Then, if this model is successful, later presentations of categorial contexts with specific gaps or ambiguities can be interpreted.

This problem seems almost identical to the lexicosemantic one just described. Some words can represent several grammatical categories, just as some can represent several meanings. Categories co-occur with very different probabilities, just as do word meanings. Thus it is reasonable to predict that appropriate and inappropriate assignments of grammatical categories can be distinguished by the categories with which they are presented.

Boggess (1988) describes the use of collocational analysis in a rather different domain: text production systems for the handicapped. The method involves predicting a likely set of words to follow the currently available (left-) context of an utterance. This list is then presented as a menu, frequently saving the user keystrokes. Boggess (1988, p. 33) defines as high frequency those types which, taken together, account for half of all tokens in a text. She notes that in the Brown Corpus 100 types account for 47% of all tokens; The corresponding table in Kučera and Francis (1967, pp. 300-307) shows also that 135 types are needed to surpass 50%. Boggess notes that in Thackeray’s novel Vanity Fair only 75 types are needed to account for 50% of all tokens. Moreover, only 50 types account for 50% of the 20,000 tokens of English uttered (by typewriter) by their experimental subject Sherri.

She asserts that providing the user with a menu of the top 20 types would obviate typing 30% of all tokens. However, taking advantage of differences in the distribution of words with respect to serial positions in sentences would yield a success rate of 40 per cent (Boggess 1988, p. 34). It would be interesting to see a graph of predictive accuracy versus sentence position; I would predict a sharp drop after the first few words due to the combinatorial explosion in potential sentential structures. Also, the number of keystrokes saved, rather than words saved, should be considered, because high frequency types tend to be quite short.

A second method, which I find more convincing, dynamically alters the menu of word types not on the basis of sentence position, but on the basis of the preceding word. This is a second order Markov approximation, dynamically updated in accordance with the individual user’s behavior. For each of the 20 most frequent successors to each of the 50 most frequent words, a second list of successors is also kept, as in a third order model. Boggess (1988, p. 36) notes that

This algorithm is related to, but takes less memory and is less powerful than a full-blown second [what I am calling third] order Markov model. . . . For an input vocabulary of [only!] 2000 words, the number of mathematically possible states in a trigram Markov model is 4,000,000, with more than 8 billion arcs interconnecting the states.

Maintaining fourth (i.e., fifth) order Markov probability tables for even a minimal vocabulary of the top 50 words plus other and sentence-end, led to 250 new states and 450 new arcs per 1000 new words of text, after 17000 words of input. Even after 100,000 words of input, 22,000 states, and 45,000 arcs, growth was still rapid (that is, new transitions were being attested).

Boggess’ examination of these fifth order Markov chains on words is to my knowledge exceeded only by Miller and Selfridge’s work with seventh order approximations (1950, see above). A full fifth order probability table for a more realistic vocabulary of 20,000 words is entirely beyond the reach of current computer technology. It would include values for 200005 or 3.2 x 10²¹ potential transitions; virtually all of these transitions would inevitably remain unattested. Indeed, 4 billion speakers uttering 100 novel transitions every second would take 250 years to utter one instance of each (the corresponding seventh order transitions would of course take 400 million times longer). Even the attested subset of the table would probably far surpass the storage limitations of any current computer. As already mentioned, analysis of grammatical categories is vastly more tractable; even a fifth order table for 88 tags has only 5 billion potential cells, most of which might be empty. Some computers can handle this size even now, and obtaining an amount of text with the same order of size is no longer inconceivable.

Ejerhead (1988) compares the effectiveness of regular expression grammars and the stochastic methods of Church (1988) for identifying basic clauses in unrestricted text. This work applies to detecting large prosodic units for a text-to-speech system; the stochastic methods used are discussed in more detail in Church (1988). In short, the transitional probability system was normalized on a text which included indicators for the beginning and end of each basic clause, and was then applied to find similar boundaries in additional text.

In earlier work (Ejerhead 1987) the same two methods are compared with respect to their ability to extract noun phrases (Ejerhead 1988, p. 220):

The regular expression output had 6 errors in 185 noun phrases, i.e. a 3.3% error rate. The stochastic output had 3 errors in 218 noun phrases, i.e. a 1.4% error rate. Both results must be considered good in the absolute sense of an automatic analysis of unrestricted text, but the stochastic method has a clear advantage over the regular expression method.

It is not clear why the 2 methods were apparently not applied to precisely the same input text (as shown by the difference in the number of noun phrases). Nevertheless, the accuracy is quite encouraging.

The analogous comparison for clause retrieval shows the regular expression method made 40 mistakes in handling 308 clauses (13% error rate), while the stochastic method made 21 mistakes in handling 304 clauses (6.9% error rate). As with parsing noun phrases, the stochastic method demonstrated considerably greater accuracy

As Ejerhead notes, it is possible that improvements to the regular expressions, or the addition of new matching mechanisms to the regular expression language, could improve their accuracy. But it remains clear that stochastic methods are a viable alternative to more conventional methods of parsing and structural retrieval.

Newman (1988) views the resolution of word-sense and attachment ambiguities as a combinatorial problem. Her method applies only after categorial disambiguation, and after a parser has identified one or more parses based primarily on syntactic considerations (p. 243). The alternative semantic readings are organized into a sequence delineated by choice points. The options at each point are assigned probabilities, and a best-first search algorithm is used to find the optimal path of choices. This improves the efficiency of resolution of multiple ambiguities.

Newman also notes that Robinson (198279.1) and Bates (1976) use weights predicated of phrase structure rules and ATN arcs to evaluate parsing alternatives. Walker and Paxton (1977) use heuristics based on probabilities of various subtrees to optimize search. And finally, Schubert (1986) controls combinatorial explosion by throwing out all but the two best parses of phrases, based on weights predicated of various attachments in relation to various syntactic and contextual factors. These various methods provide a significant quantitative improvement in parsing, but do not use stochastic or probabilistic methods as their fundamental basis.

Stochastic methods in grammatical categorization79

The CLAWS algorithm is an outgrowth of Greene and Rubin’s TAGGIT system, discussed earlier. The acronym stands for Constituent Likelihood Automatic Word-tagging System (Garside, Leech, and Sampson 1987, p. xi). It is designed to perform a similar function: to automatically tag the Lancaster - Oslo/Bergen (or LOB) Corpus of British English. The LOB Corpus is to British English much the same as the Brown Corpus is to American English. Its tag set is slightly larger than that of the Brown Corpus.

Marshall (1983, p. 139) describes CLAWS’ methods as similar to those employed in the TAGGIT program. However, the dictionary used is derived from the tagged rather than the untagged Brown Corpus. It contains (approximately) 7000 rather than 3000 entries, and 700 rather than 450 suffixes. CLAWS treats plural, possessive, and hyphenated words as special cases for purposes of initial tagging.

The LOB researchers began by using TAGGIT on parts of the LOB Corpus. They noticed that (Marshall 1983, p. 141)

While less than 25% of TAGGIT’s context frame rules are concerned with only the immediately preceding or succeeding word . . . these rules were applied in about 80% of all attempts to apply rules. This relative overuse of minimally specified contexts indicated that exploitation of the relationship between successive tags, coupled with a mechanism that would be applied throughout a sequence of ambiguous words, would produce a more accurate and effective method of word disambiguation.

A tagging suite was therefore designed, the central portion of which was a stochastic disambiguator.80.1 CLAWS was the first system to use a matrix of conditional tag probabilities. This matrix was derived from a large proportion of the Brown Corpus, specifically 200,000 words (Marshall 1983, p. 150). Using this matrix, CLAWS calculates the probabilities of all paths through the span network for each sequence of ambiguous words, choosing the most probable path. This gains the various advantages of stochastic disambiguation which were described above, but at the cost of exponential time and space complexity.

Marshall states that CLAWS calculates the most probable sequence of tags, and in the majority of cases the correct tag for each individual word corresponds to the associated tag in the most probable sequence of tags (1983, p. 142). As discussed in DeRose (1988), CLAWS has a more complex definition of most probable sequence than one might expect. Its apparent goal is to make each tag’s probability be the summed probability of all paths passing through it. Booth (1985, p. 29) and Atwell, Leech, and Garside (1984, p. 43) note that with various enhancements, CLAWS achieves a tagging accuracy of 96-97%.

In addition to collocational probabilities, CLAWS takes into account one other empirical quantity (Marshall 1983, p. 149):

Tags associated with words . . . can be associated with a marker @ or %; @ indicates that the tag is infrequently the correct tag for the associated word(s) (less than 1 in 10 occasions), % indicates that it is highly improbable . . . (less than 1 in 100 occasions). . . . The word disambiguation program currently uses these markers to devalue transition matrix values when retrieving a value from the matrix, @ results in the value being halved, % in the value being divided by eight.

Thus, the independent probability of each possible tag for a given word influences the choice of an optimal path. I will refer to such probabilities as Relative Tag Probabilities, or RTPs, and will investigate them in some detail.81.1

Other features have also been added to CLAWS, such as pre- and post-editing steps, special handling of idioms, and inclusion of some carefully chosen tag-triple probabilities. A particularly significant addition to the algorithm is the last-mentioned, in which (Marshall 1983, p. 146)

a number of tag triples associated with a scaling factor have been introduced which may either upgrade or downgrade values in the tree computed from the one-step matrix. For example, the triple [1] ‘be’ [2] adverb [3]past-tense-verb has been assigned a scaling factor which downgrades a sequence containing this triple compared with a competing sequence of [1] ‘be’ [2] adverb [3] past-participle/adjective, on the basis that after a form of ‘be’, past participles and adjectives are more likely than a past tense verb.

A similar move was used near conjunctions, for which the words on either side, though separated, are more closely correlated to each other than either is to the conjunction itself (Marshall pp. 146-147). For example, a verb/noun ambiguity conjoined to a verb should probably be taken as a verb. Leech, Garside, and Atwell (1983, p. 23) describe IDIOMTAG, which is applied after initial tag assignment and before disambiguation. It was

developed as a means of dealing with idiosyncratic word sequences which would otherwise cause difficulty for the automatic tagging. . . . for example, in order that is tagged as a single conjunction. . . . The Idiom Tagging Program . . . can look at any combination of words and tags, with or without intervening words. It can delete tags, add tags, or change the probability of tags. Although this program might seem to be an ad hoc device, it is worth bearing in mind that any fully automatic language analysis system has to come to terms with problems of lexical idiosyncrasy.

IDIOMTAG also accounts for the fact that the probability of a verb being a past participle and not simply past, is greater when the following word is by, as opposed to other prepositions. Certain cases of this sort may be soluble by making the collocational matrix distinguish classes of ambiguities. Approximately 1% of running text is tagged by IDIOMTAG (letter, G. N. Leech to Henry Kučera, June 7, 1985; letter, E. S. Atwell to Henry Kučera, June 20, 1985).

Marshall notes the possibility of using third order approximations, expressed in a matrix which would map ordered triples of tags into the relative probability of occurrence of each such triple. He points out, rightly I think, that such a table is larger than its usefulness justifies. Also, statistical significance would require a much larger corpus.82.1

CLAWS has been applied to the entire LOB Corpus with an accuracy of between 96% and 97% (Booth 1985, p. 29). Without the idiom list, the algorithm was 94% accurate on a sample of 15,000 words (Marshall 1983, p. 142). Thus, the pre-processor tagging of 1% of all tokens results in a 3% improvement in accuracy.

CLAWS has several drawbacks, although its stochastic method has proven very effective. Specifically, it is time- and storage-inefficient to the extent that a fallback algorithm is sometimes employed to prevent running out of memory. CLAWS calculates the probability of every path, and so operates in time and space proportional to the product of all the degrees of ambiguity of the words in the span. Thus, the time required is an exponential function of the span length. For the longest span in the Brown Corpus, 17 ambiguities in a row, the number of paths examined would be 1,492,992.

DeRose (1985, 1988) presents a stochastic disambiguation algorithm called Volsunga, which is conceptually similar to CLAWS; it is a subset of the system reported in this thesis, using the same general methods already described. It differs from CLAWS in that several non-stochastic portions of the algorithm (such as a pre-processing step to tag idioms) are deleted. Also, dynamic programming methods (see Dano 1975, Dreyfus and Law 1977) are used to find the optimal paths quickly.

The entire Brown Corpus is also used for normalization, rather than 20% as in CLAWS. Evaluation of accuracy is isometric, i.e. based on re-tagging and comparing the entire Brown corpus, rather than on manually checking a novel text. Later in this thesis I present a number of control experiments which bear on this method.

Table 9 (from DeRose 1988, p. 38) shows the accuracy of the Volsunga tagging algorithm, broken down by genre within the Brown Corpus. It is more accurate than CLAWS is without idiom-tagging, and slightly less accurate than CLAWS is with idiom tagging. It is this algorithm which I refine and investigate in much more detail in this thesis.

Table 9: Volsunga tagging accuracy (by genre)
Genre	Size	% Accuracy
A: Press Reportage	99,165	96.36
B: Press Editorial	60,716	96.09
C: Press Reviews	39,832	96.12
D: Religion	38,631	96.01
E: Skills/Hobbies	81,659	95.34
F: Popular Lore	108,617	95.99
G: Belles Lettres	169,789	96.35
H: Miscellaneous	69,508	96.66
J: Learned	179,927	96.38
K: General Fiction	67,083	95.72
L: Mystery/Detective	56,090	95.47
M: Science Fiction	13,956	95.40
N: Adventure/Western	67,673	95.58
P: Romance/Love Story	68,337	95.54
R: Humor	20,990	95.55
Informative Prose Total	847,844	96.20
Imaginative Prose Total	294,129	95.57
Overall Total	1,141,973	96.04

Church (1988) presents a stochastic disambiguator based on third order Markov probabilities derived from the Brown Corpus.84.1 He gives a clear presentation of the difficulties inherent in parser-based disambiguation, showing that grammatical category ambiguity is far more widespread than usually thought.

Church’s method combines collocational probabilities with what I have called Relative Tag Probabilities, for which his term is Lexical Probability Estimates (p. 139). Church also has applied the insights of dynamic programming, and obtains solutions in linear time.

Church addresses the drawback of building a lexicon and generating statistics from sample text by consulting a standard dictionary to find categorial ambiguities not otherwise attested. Church adds one to the count of occurrences for lexical probability estimates, for each word/tag pair which is shown in a dictionary. Church reports that a graph of frequency vs. rank for tag triples shows the expected log-log relationship.

The final accuracy is reported as 95-99% ‘correct’, depending on the definition of ‘correct’, with 99.5% accuracy on a small sample presented in an appendix. This is consonant with other stochastic disambiguation methods, though with a somewhat higher maximum. The higher maximum may be due to the use of third rather than second order approximations, or to inaccuracies of measurement, since Church has not reported tests against a standard tagged target corpus. Also, since a third order table for 88 tags has 681,472 cells, a normalization corpus of only 1 million words is not sufficient to characterize the language as a whole.

Church’s related work with Ejerhead on a stochastic noun phrase retriever has been discussed above.

Next: Findings and Analysis