DeRose, Steven J. 1990.
Stochastic Methods for Resolution of Grammatical Category Ambiguity in Inflected and Uninflected Languages.
Ph.D. Dissertation. Providence, RI: Brown University Department of Cognitive and Linguistic Sciences.
Copyright 1989, 2013, Steven J. DeRose. See below for details.
Grammatical category ambiguity (distinct from semantic and structural ambiguity) is extremely frequent in natural language. In the Brown Corpus (a million-word grammatically tagged sample of English prose) 11% of all word forms, or 48% of all word instances, occur as members of more than one grammatical category. These figures greatly under-represent actual categorial ambiguity, for instance because uncommon words may seem unambiguous when they are actually not. Such frequent ambiguity poses extreme problems of non-determinism for parsers. Therefore means of resolving such ambiguities are important to the progress of natural language processing systems.
This thesis examines probabilistic strategies for resolving categorial ambiguity. I consider the contextual probability of a given category, given a categorial context, and the relative probability that a given word form represents a particular category. Up to 96% of all words can be assigned the correct category without morphological analysis, special handling of idioms, or other non-probabilistic features. Dynamic programming yields disambiguation time directly proportional to text length. Probabilistic methods are thus both faster and more accurate than previous methods, and overcome the non-determinism which renders many other methods unworkable.
I apply these methods to the Brown Corpus (English) and to the Greek New Testament (140,000 words of Koiné Greek). I discuss the effects of various parameters on the accuracy of category assignment, and analyze the types and frequencies of residual errors. I report control studies which help to predict the algorithm’s effectiveness for unrestricted text, and investigate the amount of normalization text required to obtain reliable probability estimates. Analyses of related information-theoretic properties of natural language corpora are also included, for example, investigations of the effect of sample size on measurement of entropy.
baseline)
The original published form of the dissertation is available from University Microfilms, order number 9002217.
This electronic edition was converted from the original word-processor files to XHTML by the author. BBEdit did most of the table conversions automatically. I did the rest via an awful lot of regex changes and hand-editing.
This is how a recently-added note looks.
Differences of which the author is aware:
Copyright 1989, 2013, Steven J. DeRose.
This work by Steven J. DeRose is licensed under a Creative Commons Attribution - No Derivatives license. This means you are free to reproduce or distribute it, but must cite the source, and cannot make changes. For further information on this license, see http://creativecommons.org/licenses/by-sa/3.0/.
For the most recent version, see http://www.derose.net.
This work is licensed under a Creative Commons Attribution-NoDerivs 3.0 Unported License.