Abstract

Grammatical category ambiguity (distinct from semantic and structural ambiguity) is extremely frequent in natural language. In the Brown Corpus (a million-word grammatically tagged sample of English prose) 11% of all word forms, or 48% of all word instances, occur as members of more than one grammatical category. These figures greatly under-represent actual categorial ambiguity, for instance because uncommon words may seem unambiguous when they are actually not. Such frequent ambiguity poses extreme problems of non-determinism for parsers. Therefore means of resolving such ambiguities are important to the progress of natural language processing systems.

This thesis examines probabilistic strategies for resolving categorial ambiguity. I consider the contextual probability of a given category, given a categorial context, and the relative probability that a given word form represents a particular category. Up to 96% of all words can be assigned the correct category without morphological analysis, special handling of idioms, or other non-probabilistic features. Dynamic programming yields disambiguation time directly proportional to text length. Probabilistic methods are thus both faster and more accurate than previous methods, and overcome the non-determinism which renders many other methods unworkable.

I apply these methods to the Brown Corpus (English) and to the Greek New Testament (140,000 words of Koiné Greek). I discuss the effects of various parameters on the accuracy of category assignment, and analyze the types and frequencies of residual errors. I report control studies which help to predict the algorithm’s effectiveness for unrestricted text, and investigate the amount of normalization text required to obtain reliable probability estimates. Analyses of related information-theoretic properties of natural language corpora are also included, for example, investigations of the effect of sample size on measurement of entropy.

Edition information

The original published form of the dissertation is available from University Microfilms, order number 9002217.

This electronic edition was converted from the original word-processor files to XHTML by the author. BBEdit did most of the table conversions automatically. I did the rest via an awful lot of regex changes and hand-editing.

I have added a few notes and corrections during the conversion process in 2012-2013. They are marked up as paragraphs or spans, with HTML class="annot", and appear in magenta type, either bracketed inline or in blocks like this:

This is how a recently-added note looks.

Differences of which the author is aware:

I have omitted most of the front matter (Preface, CV, Table of Contents).
The text has been made into one HTML file for each chapter and appendix (plus a CSS file to control layout.
I have added the original page numbers at the end of the titles of larger sections, in the same color as notes and corrections, like0; and added hierarchical section-numbers at the start of all headings. These should correspond exactly to the sequence and levels evident in the original Table of Contents (although no explicit section-numbering appeared there).
I have added many links. This starting file has the Abstract and links to all the other files, graphs, and tables. Chapters have a "next" link at the end, to the next file. The Bibliography has section dividers and links at the top to get the the start of each letter in the alphabetical list (by first author).
Citations have been linked to the corresponding Bibliography entries. During this process I discovered a few omissions and typos in the Bibliography, whose correction is indicated using the convention described above.
I have gathered the footnotes here, and they are linked from their original attachment points in the text. I have renumbered them using their original page, plus a dot, plus their original superscript symbol.
Equations have been manually converted to MathML; a few are incomplete.
Italics were lost. I've re-tagged some, but probably not all.

This work by Steven J. DeRose is licensed under a Creative Commons Attribution - No Derivatives license. This means you are free to reproduce or distribute it, but must cite the source, and cannot make changes. For further information on this license, see http://creativecommons.org/licenses/by-sa/3.0/.

For the most recent version, see http://www.derose.net.

This work is licensed under a Creative Commons Attribution-NoDerivs 3.0 Unported License.

Abstract

Chapters

Appendices

Edition information