Chapter 1: Introduction

Aims and organization

Categorial ambiguity arises when a particular word form can, in different instances, represent different grammatical categories. It is distinct from other sorts of ambiguity, such as semantic and structural. Semantic ambiguity has been much investigated, not only by linguists, but also by philosophers, practitioners of artificial intelligence, and many others. Structural ambiguity has largely been the domain of syntacticians.

Categorial ambiguity is, by comparison, seldom treated. Nevertheless, it arises so frequently.that an effective means of resolving it is of paramount importance to the progress of natural language processing systems. Such systems must therefore provide effective means to determine the correct grammatical category for each word instance in texts to be handled.

An idea of how widespread categorial ambiguity is may be obtained by examining the Brown Corpus, a sample of over 1 million running words of American written English (see Kučera and Francis 1967, Francis and Kučera 1979, 1982), in which every word has been assigned a grammatical category. In this sample, approximately 11% of word types (i.e., of the vocabulary) and 48% of the word tokens (i.e., of the text length) are categorially ambiguous. That is, those word types occur in the Brown Corpus with different category labels in different instances. The actual extent of categorial ambiguity in English is certainly much higher. I will discuss several reasons for this later in this thesis, but one worth noting from the beginning is that nearly 45% of the Brown Corpus’s vocabulary consists of hapax legomena, i.e. words occurring only once. Obviously the hapax legomena appear to be categorially unambiguous, even though a larger corpus or native speaker intuitions may show that many of them are in fact ambiguous. 2.1 If we merely exclude the hapax legomena, about 20% of the remaining word types are ambiguous (as compared with 11% of all types including the hapax legomena).

Because it is so widespread, ambiguity poses a substantial challenge to deterministic natural language processing models. Milne (1986, p. 1) claims that one of the major causes of non-determinism is part-of-speech ambiguity. Rieger and Small (1979, pp. 2ff) suggest that few parsing models even make a serious attempt to deal with the actual pervasiveness of ambiguity in natural language; more attention has been given the problem recently (e.g., Hirst 1983), but the specific problem of part-of-speech or grammatical category ambiguity has still received minimal attention.

In this thesis I will examine a particular class of strategies by which computer systems may resolve instances of categorial ambiguity. The strategies are stochastic in nature; that is, they incorporate knowledge of probabilities as a basis for generating hypotheses. Two crucial sorts of probabilities are: (a) the contextual probability of a given category, given a particular categorial context, and (b) the relative probability that a given word form (taken in isolation) represents a particular category. Both kinds of probabilities are determined empirically, by examination of large text samples.

Previous non-stochastic methods have had accuracies ranging from 77-90% (Klein and Simmons 1963, Greene and Rubin 1971). Stochastic methods have claimed accuracies ranging from 93-99% (Leech, Garside, and Atwell 1983, DeRose 1985, DeRose 1988, Church 1988).

Although other strategies are available for resolving such ambiguity, the stochastic method is unique in combining the virtues of speed and accuracy. It also can be formally defined, and does not require detailed preparatory human intervention.

I will examine these methods as applied to large corpora in American English and Koiné Greek. Various parameters which affect the accuracy and reliability of category assignment will be examined, and I will discuss their respective effects. Specifically, I will consider

(a) the baseline accuracy, i.e. that obtained by a very simplistic algorithm which always assigns the most popular grammatical category for each word form;
(b) the accuracy of a tagger which uses only collocational probabilities;
(c) the effect of combining the information of (a) and (b);
(d) the effect of making a more fine-grained subdivision of the set of grammatical categories for English (e.g., treating nouns which can also be verbs as a class distinct from unambiguous nouns and unambiguous verbs), and predicating probabilities of these more specific categories;
(e) a set of control studies which help to measure the effect of the size and choice of corpus used for normalizing the dictionary and probability tables, on the reliability of stochastic methods;
(f) an analysis of residual tagging errors;
(g) some information-theoretic properties related to (e); and
(h) the relative effectiveness and reliability of stochastic disambiguation methods for English and Greek.

The following chapter introduces the fundamental linguistic and mathematical concepts which are required. Chapter 3 reviews previous investigations into the phenomena in question. Previous research is divided into

(a) non-stochastic approaches to grammatical categorization,
(b) stochastic approaches to linguistic problems other than grammatical categorization, and
(c) stochastic approaches to the problem of grammatical categorization per se.

Chapter 4 presents the application of particular stochastic tagging methods to English and Greek corpora, along with the control studies and information theoretic results. These two languages were chosen because large grammatically tagged texts are available in both, and because of their typological differences. English represents a configurational language, and tends toward fixed word order and simple morphology. Koiné Greek, that dialect in which the New Testament was written, is much closer to the synthetic end of the typological spectrum; it has relatively free word order and a rich morphological system.

Chapter 5 reviews the results for English and Greek and presents general conclusions. Since the stochastic methods I am investigating involve collocational phenomena, one might expect that disambiguation would be far more effective in English than in Greek (due to Greek’s freer word order); we will see, however, that this is not the case. However, the sizes of the corpora used, and the natures of the tagging systems, will prove to be critical factors in the reliability of stochastic tagging methods.

I have included appendixes which present various results in tabular and graphic form; other tables appear throughout the text. Lists of Graphs and Tables are also included for the reader’s convenience in locating data of interest.