An Algorithmic Approach to English Pluralization

An Algorithmic Approach to English Pluralization
(Extended Abstract)

Damian Conway

School of Computer Science and Software Engineering
Monash University
Clayton 3168, Australia

`mailto:damian@csse.monash.edu.au http://www.csse.monash.edu.au/~damian`

Abstract

This paper discusses some of the issues involved in designing robust and comprehensive algorithms which convert singular English nouns, verbs and adjectives to their appropriate plural forms. An overview is given of a full implementation of the various algorithms in the Perl [1] programming language. The full paper is available on-line from:
http://www.cs.monash.edu.au/~damian/papers/HTML/Inflections.html

The problem of English plurals

The English language is overburdened with idiosyncratic grammatical features, a legacy of its eclectic accretion over 1500 years [2,3]. One unfortunate consequence of this otherwise admirable richness is that automatically generating correct English is fraught with difficulty. The use of English plurals in synthetic sentences is a case in point. In computing applications, for example, it is quite common to encounter error messages which jar because they do not correctly inflect for grammatical number:

        Compilation aborted: 1 errors were detected.

Individually, such inelegances are easily overcome (or, more accurately, the inelegance may be transferred from the interface to the code):

        print "Compilation aborted: $count ",
              ($count==1 ? "error was" : "errors were"),
              " detected.\n";

Unfortunately, in attempting to generate more complex text, some less tractable problems arise, notably the diversity of plural forms available in English. Consider the difficulty faced by a text generation system (machine or human) in converting the following to their correct plural forms:

        Her criterion differs from mine.
        The Major General met the Governor General.
        Analysis of this aquarium's fish failed to determine its genus.
        That phalanx suffered a trauma.

The full paper presents algorithms that provide (nearly) automatic plural inflections for such examples.

Categories of English plurals

Universal rules

Although described here first, and encountered most frequently, the universal rules of plural inflection are the "last resort" in an algorithmic sense. That is, these rules only apply when all other more specific rules or special cases are inapplicable.

The rules themselves are well-known and need no elaboration. By default:

Nouns are made plural by appending -s.
Verbs are made plural by removing any trailing -s (and otherwise do not change).
Adjectives and adverbs do not change when made plural.

Suffix categories

There are, however, an enormous number of exceptions to these defaults [4]. Most such exceptions are still regular (in the sense that they occur in predictable patterns), but are specific to a particular word suffix.

For example, nouns that end in -ss universally become -sses in the plural (and vice versa for verbs). Likewise, nouns which end in a consonant followed by -y almost always become -ies in the plural.

Certain types of adjectives also inflect in the plural. For example, possessive adjectives that end in -'s or -' in the singular, are made plural by forming the plural of the root word and appending an apostrophe (unless the root's plural does not itself end in -s, in which case -'s is appended). Hence cat's becomes cats', mantis'becomes mantises', whilst child's becomes children's.

Other suffix categories arise because words of foreign origin (most commonly Ancient Greek or Latin) have retained a non-anglicized plural inflection. Hence criterion becomes criteria, nucleus becomes nuclei, and matrix becomes matrices.

Dealing with such categories is complicated by the fact that many other imports have been wholly or partially anglicized. Hence although criterion always forms its plural with -a, ganglion may take either -s or -a (ganglions or ganglia), whilst bastion is always inflected with -s. Occasionally the anglicized and "classical" plural forms of a word may both be in common use, but with distinct meanings. Thus a copy-editor might remove appendices, whereas a surgeon would remove appendixes.

The correct inflection of words derived from Latin can be particularly complex, since the same suffix may form different Latinate plurals depending on the declension (or sometimes the part of speech) of the original. Thus the plural of stimulus (second declension) is stimuli, and that of genus (third declension) is genera. Status (fourth declension) is traditionally unchanged in the plural, whilst ignoramus (a first person plural Latin verb) has been wholly anglicized and becomes ignoramuses.

The only practical way to deal with such complexities in an algorithm is to categorize words by both suffix and inflection, and to allow for both anglicized and classical variants. The full paper lists a number of such categories.

Exceptions

Some categories of words contain only a single example (ox -> oxen, mongoose -> mongooses), and are more appropriately treated as exceptions to more general rules. Such exceptions are most efficiently handled via (hashed) table look-up, rather than pattern matching on suffixes. The full paper provides a comprehensive list of such words.

Pluralizing algorithms for English

The full paper presents algorithms for forming plurals of English nouns, verbs, and adjectives. The algorithms are based on the rules of English inflection described in the Oxford English Dictionarym [5] (OED), Fowler's Modern English Usage [6], and A Practical English Grammar [1] . Where these sources disagree, the OED is taken to be definitive.

Each algorithm first considers any user-defined plural transforms (this permits the algorithms to be
adapted to local dialects or conventions). An exceptions table is then consulted to handle any singular special cases. The algorithms next test for a range of generic special cases, usually looking for specific suffixes or other standard patterns. Finally, the algorithms use the appropriate universal default rule to inflect unrecognized (and presumably regular) words.

Having specified an algorithm for each particular part of speech, it is a relatively simple matter to combine them and construct a single algorithm that correctly handles any of these parts of speech. The general approach taken is to treat a word being pluralized as if it were a noun, unless it can be unambiguously recognized as a verb or adjective. Hence the unified pluralization algorithm first honours any user-defined inflections, then seeks to apply a subset of the steps from the verb- and adjective-specific algorithms presented above and, if they fail, finally applies the entire noun-specific algorithm to the word.

"Number-insensitive" comparisons

One task which is complicated by the irregular inflections of many English plurals is that of indexing or cross-referencing text. Any reliable indexing algorithm will need to be able to identify text containing the various irregular plural forms of (normally) singular definitions, as well as the singular form of any plural definitions. Worse still, if the algorithm is to correctly cross-reference "classical" plurals such as kine, brethren, and stigmata, words which are alternate plural forms of a common singular (for example: cows and kine) must also be identified.

The full paper presents an algorithm for a "number insensitive" equality test between two words. The algorithm returns which returns true if:

the two words are identical, or
one word is a plural form of the other, or
the two words are distinct plural forms of some other word.

A Perl implementation

This section briefly summarizes a freely available Perl implementation (Lingua::EN::Inflect) of the pluralization algorithms presented above. The module and full supporting documentation are available from the Comprehensive Perl Archive Network (via http://www.perl.com), or directly from the author: http://www.csse.monash.edu.au/~damian/CPAN/Lingua-EN-Inflect.gz.tar

Inflecting plurals - the `PL_...()` subroutines

Lingua::EN::Inflect provides four exportable subroutines (prefixed PL_...) which implement the noun-, verb-, adjective-, and unified pluralization algorithms. All of the PL_...() subroutines take the singular form of the word to be inflected as their first argument and return the corresponding plural form. Each function also takes an optional second parameter, specifying the required grammatical number of the word. The plural inflection is only performed if the specified number is not 1.

The subroutines are:

PL_N($;$): PL_N() takes a singular English noun or pronoun and returns its plural.
PL_V($;$): PL_V() takes the singular form of a conjugated verb (one which is already in the correct grammatical person and mood) and returns the corresponding plural conjugation.
PL_ADJ($;$): PL_ADJ() takes the singular form of certain types of adjectives and returns the corresponding plural form.
PL($;$): PL() takes a singular English noun, pronoun, verb, or adjective and returns its plural form. Where a word has more than one inflection depending on its part of speech, the (singular) noun sense is generally preferred to the (singular) verb sense. Of course, the inherent ambiguity of such cases suggests that, where the part of speech is known, PL_N(), PL_V(), and PL_ADJ() should be used in preference to PL().

These would typically be used as follows:

        $food = "fish";
        print PL_ADJ("My"), PL_N(" child"), PL_V(" eats "), PL($food);

which would produce:

        Our children eat fish

Modern vs classical inflections

Lingua::EN::Inflect can optionally distinguish between modern and classical plural variants (for example, radiuses versus radii) via the exportable subroutine classical(). If classical() is called with no arguments, it unconditionally invokes classical mode. If it is called with an argument, it invokes classical mode only if that argument evaluates to true. If the argument is false, classical mode is switched off. In classical mode, the non-anglicized plural form of a word (if one exists) is preferred. Hence, whereas dogma is normally inflected to dogmas, if classical mode is active it becomes dogmata.

Interpolating inflections in strings - The `inflect()` subroutine

By far the commonest use of the inflection subroutines is to produce message strings for various purposes. Unfortunately the need to separate each PL_...() subroutine call often detracts from the readability of the resulting code. To ameliorate this problem, Lingua::EN::Inflect provides an exportable string-interpolating subroutine (inflect($)), which recognizes calls to the various inflection subroutines within a string and interpolates them appropriately.

Using inflect(), plurals can be interpolated directly into a string as follows:

print inflect "$errs PL_N(error,$errs) PL_V(was,$errs) detected.\n";

Comparing "number-insensitively" - The `PL_..._eq()` subroutines

Lingua::EN::Inflect also implements the number-insensitive comparison algorithm mentioned above, providing the exportable subroutines PL_eq($$), PL_N_eq($$), PL_V_eq($$), and PL_ADJ_eq($$). Each of these subroutines takes two strings, and compares them using the corresponding plural-inflection subroutine (PL(), PL_N(), PL_V(), and PL_ADJ() respectively).

The actual value returned by the various PL_eq_...()subroutines encodes which of the three equality rules succeeded: "eq" is returned if the strings were identical, "s:p" if the strings were singular and plural respectively, "p:s" for plural and singular, and "p:p" for two distinct plural forms of a word (for example, octopuses and octopodes). Inequality is indicated by returning an empty string.

Conclusion

Capturing the English plural inflection in reliable algorithms proves to be a feasible, if challenging, task. The robustness of such algorithms depends heavily on encoding general rules (categories of inflection), rather than attempting to enumerate many hundreds of exceptions to the universal defaults.

It is possible to cater for differences in major usage patterns (for example, modern and classical inflections) and for local differences in dialect (via user-defined inflections). It is also possible to make use of the pluralization algorithms to efficiently detect pairs of words which differ only in grammatical number.

References

[1]: Wall, L., Christiansen, T., & Schwartz, R.L., Programming Perl, 2nd Edition, O'Reilly & Associates, 1996.
[2]: McCrum, R., Cran, W., & MacNeil, R., The Story of English, Penguin Books, New York, 1986.
[3]: Bryson, B., The Mother Tongue: English and how it got that way, William Morrow, New York, 1990.
[4]: Thomson, A.J., & Martinet, A.V., A Practical English Grammar, Fourth Edition, Oxford University Press, Oxford, 1986.
[5]: The Oxford English Dictionary, Second Edition, Oxford University Press, Oxford, 1989.
[6]: Fowler, H.W., Modern English Usage, Second Edition, Oxford University Press, Oxford, 1965.