DeRose Dissertation: Conclusion

Chapter 5: Conclusions124

Stochastic resolution of grammatical category ambiguity is an effective method, which can produce better accuracy than past rule-driven methods. It can also operate in linear time, making it practical for a wide range of applications. I have investigated several methodological factors which affect the accuracy achieved by stochastic methods.

First, I considered a simplistic tagger which uses no collocation information, but merely assigns the favorite tag for each word. For isometric tagging of the million-word Brown Corpus this method produced over 93% accuracy, though of course it required a nearly complete dictionary. In the case of using this tagging method on text not used for normalization, I found the (probably more generally applicable) figure of 88%. For isometric tagging of the Greek New Testament this method was comparably accurate, at 89%.

Second, I considered a tagger which used only collocational information. It performed slightly less accurately in the isometric case, but was substantially more robust when applied to novel text.

Third, I found that for English, with the normal Brown Corpus tag set, the most effective disambiguation strategy was to combine collocational information with relative tag probabilities for particular lexical items (RTPs), with 30-35% weight assigned to the latter. This method achieved about 95% accuracy in the isometric case, and the most frequent errors were past-tense verb vs. past participle, verb vs. noun, and adjective vs. noun. That and to were particularly troublesome lexical items to tag.

Fourth, I found that classifying words into finer-grained categories in accordance with the distinct sets of categories each can have, improved accuracy to about 97.14% on the Brown Corpus. At the same time this change entirely obviated RTPs. However, this increase was at least partly a consequence of the expanded tag set being too large to be reliably sampled with 1 million words of normalization text, and hence may not be applicable to unrestricted text.

Fifth, dictionary coverage had a roughly linear effect on total accuracy, down to the point where only 60% of all tokens were known (a dictionary of about 400 word forms), after which accuracy degraded very rapidly. This is of limited practical significance, however, since even modest dictionaries can achieve 90% coverage or more for English. Independent normalizations based on separate halves of the Brown Corpus showed a similar decrease in the effectiveness of tagging across the halves. The loss of accuracy was in all cases largely limited to the unknown words; this has the extremely useful consequence that a tagger can point out its less reliable decisions.

Sixth, I discussed the frequencies with which specific tags were missed, and with which specific tags were mistakenly assigned in their places. The tags most frequently missed were, in general, those whose representative word forms are used far more frequently to represent another category. Special strategies could be developed to improve the accuracy obtained for such tags, but the tags with poor accuracy are all relatively rare, and may not warrant special treatment.

While examining some properties of entropy, I found that the second order entropy of Greek tags in the New Testament was surprisingly low. I then investigated the effect of corpus size on entropy, and discovered that for the Brown Corpus second order entropy rises more or less logarithmically with corpus size, to become fairly stable by about 300,000 words. I hypothesized that the size of an adequate corpus is directly related to the number of potential collocations (i.e., the square of the number of tags). If this is even approximately correct, then the Greek New Testament is far too small a corpus to allow reliable conclusions when working with collocations drawn from a set of 1155 distinct tags.

I found that total ambiguity in the Greek New Testament is comparable to that found for English, though of a quite different sort. Disambiguation was only slightly less effective than for English, achieving peak accuracies of about 93.6%. RTPs were not effective for Greek, probably because of the difference in the kind of ambiguity which is present and the inadequate size of the normalization corpus.

Care must be taken that normalization corpora are of adequate size and that representative probabilities are obtained. If these conditions are met, stochastic tagging methods clearly provide a viable and efficient means of assigning grammatical categories in unrestricted text. Since dictionary coverage is a significant factor, anyone using stochastic tagging methods for unrestricted text would do well to pay particular attention to the production of a good dictionary. Nevertheless, stochastic methods are robust, and degrade only gradually when inadequately normalized. Practical dictionary sizes, say 35,000-75,000 word forms, should provide high enough coverage to allow reliable tagging and disambiguation at an overall accuracy of over 90%. Residual errors are concentrated among unknown words, the handling of which is therefore an appropriate focus for future investigations.

Because of the extremely high degree of grammatical category ambiguity in natural language, natural language processing systems must come to terms with the problem of excessive nondeterminism in parsing. Probabilistic methods provide extremely fast and reasonably accurate means to reduce or eliminate low-probability interpretations, and therefore can make a significant contribution to the realizability and effectiveness of systems which can deal with a large range of natural language phenomena.

Next: Bibliography