An Algorithmic Approach to English Pluralization
(Extended Abstract)
Damian Conway
School of Computer Science and Software Engineering
Monash University
Clayton 3168, Australia
Abstract
This paper discusses some of the issues involved in designing robust and
comprehensive algorithms which convert singular English nouns, verbs and
adjectives to their appropriate plural forms. An overview is given of a
full implementation of the various algorithms in the
Perl [1]
programming language. The full paper is available on-line from:
http://www.cs.monash.edu.au/~damian/papers/HTML/Inflections.html
The problem of English plurals
The English language is overburdened with idiosyncratic grammatical features,
a legacy of its eclectic accretion over 1500 years [2,3].
One unfortunate consequence of this otherwise admirable richness is that
automatically generating correct English is fraught with difficulty. The
use of English plurals in synthetic sentences is a case in point. In computing
applications, for example, it is quite common to encounter error messages
which jar because they do not correctly inflect for grammatical number:
Compilation aborted: 1 errors were detected.
Individually, such inelegances are easily overcome (or, more accurately,
the inelegance may be transferred from the interface to the code):
print "Compilation aborted: $count ",
($count==1 ? "error was" : "errors were"),
" detected.\n";
Unfortunately, in attempting to generate more complex text, some less tractable
problems arise, notably the diversity of plural forms available in English.
Consider the difficulty faced by a text generation system (machine or human)
in converting the following to their correct plural forms:
Her criterion differs from mine.
The Major General met the Governor General.
Analysis of this aquarium's fish failed to determine its genus.
That phalanx suffered a trauma.
The full paper presents algorithms that provide (nearly) automatic plural
inflections for such examples.
Categories of English plurals
Universal rules
Although described here first, and encountered most frequently, the universal
rules of plural inflection are the "last resort" in an algorithmic sense.
That is, these rules only apply when all other more specific rules or special
cases are inapplicable.
The rules themselves are well-known and need no elaboration. By default:
-
Nouns are made plural by appending -s.
-
Verbs are made plural by removing any trailing -s
(and otherwise do not change).
-
Adjectives and adverbs do not change when made plural.
Suffix categories
There are, however, an enormous number of exceptions to these defaults [4].
Most such exceptions are still regular (in the sense that they occur in
predictable patterns), but are specific to a particular word suffix.
For example, nouns that end in -ss universally become
-sses in the plural (and vice versa for verbs). Likewise,
nouns which end in a consonant followed by -y almost always
become -ies in the plural.
Certain types of adjectives also inflect in the plural. For example,
possessive adjectives that end in -'s or -'
in the singular, are made plural by forming the plural of the root word
and appending an apostrophe (unless the root's plural does not itself end
in -s, in which case -'s is appended).
Hence cat's becomes cats', mantis'
becomes mantises', whilst child's
becomes children's.
Other suffix categories arise because words of foreign origin (most
commonly Ancient Greek or Latin) have retained a non-anglicized plural
inflection. Hence criterion becomes criteria,
nucleus becomes nuclei, and matrix
becomes matrices.
Dealing with such categories is complicated by the fact that many other
imports have been wholly or partially anglicized. Hence although criterion
always forms its plural with -a, ganglion
may take either -s or -a (ganglions
or ganglia), whilst bastion is always inflected
with -s. Occasionally the anglicized and "classical" plural
forms of a word may both be in common use, but with distinct meanings.
Thus a copy-editor might remove appendices, whereas a surgeon
would remove appendixes.
The correct inflection of words derived from Latin can be particularly
complex, since the same suffix may form different Latinate plurals depending
on the declension (or sometimes the part of speech) of the original. Thus
the plural of stimulus (second declension) is stimuli,
and that of genus (third declension) is genera.
Status (fourth declension) is traditionally unchanged in
the plural, whilst ignoramus (a first person plural Latin
verb) has been wholly anglicized and becomes ignoramuses.
The only practical way to deal with such complexities in an algorithm
is to categorize words by both suffix and inflection, and to allow
for both anglicized and classical variants. The full paper lists a number
of such categories.
Exceptions
Some categories of words contain only a single example (ox
-> oxen, mongoose ->
mongooses), and are more appropriately treated as exceptions
to more general rules. Such exceptions are most efficiently handled via
(hashed) table look-up, rather than pattern matching on suffixes. The full
paper provides a comprehensive list of such words.
Pluralizing algorithms for English
The full paper presents algorithms for forming plurals of English nouns,
verbs, and adjectives. The algorithms are based on the rules of English
inflection described in the Oxford English Dictionarym [5]
(OED), Fowler's Modern English Usage [6],
and A Practical English Grammar [1] . Where these
sources disagree, the OED is taken to be definitive.
Each algorithm first considers any user-defined plural transforms (this
permits the algorithms to be
adapted to local dialects or conventions). An exceptions table is then
consulted to handle any singular special cases. The algorithms next test
for a range of generic special cases, usually looking for specific suffixes
or other standard patterns. Finally, the algorithms use the appropriate
universal default rule to inflect unrecognized (and presumably regular)
words.
Having specified an algorithm for each particular part of speech, it
is a relatively simple matter to combine them and construct a single algorithm
that correctly handles any of these parts of speech. The general approach
taken is to treat a word being pluralized as if it were a noun, unless
it can be unambiguously recognized as a verb or adjective. Hence the unified
pluralization algorithm first honours any user-defined inflections, then
seeks to apply a subset of the steps from the verb- and adjective-specific
algorithms presented above and, if they fail, finally applies the entire
noun-specific algorithm to the word.
"Number-insensitive" comparisons
One task which is complicated by the irregular inflections of many English
plurals is that of indexing or cross-referencing text. Any reliable indexing
algorithm will need to be able to identify text containing the various
irregular plural forms of (normally) singular definitions, as well as the
singular form of any plural definitions. Worse still, if the algorithm
is to correctly cross-reference "classical" plurals such as kine,
brethren, and stigmata, words which are
alternate plural forms of a common singular (for example: cows
and kine) must also be identified.
The full paper presents an algorithm for a "number insensitive" equality
test between two words. The algorithm returns which returns true if:
-
the two words are identical, or
-
one word is a plural form of the other, or
-
the two words are distinct plural forms of some other word.
A Perl implementation
This section briefly summarizes a freely available Perl implementation
(Lingua::EN::Inflect) of the pluralization algorithms
presented above. The module and full supporting documentation are available
from the Comprehensive Perl Archive Network (via http://www.perl.com),
or directly from the author:
http://www.csse.monash.edu.au/~damian/CPAN/Lingua-EN-Inflect.gz.tar
Inflecting plurals - the PL_...() subroutines
Lingua::EN::Inflect provides four exportable subroutines
(prefixed PL_...) which implement the noun-, verb-, adjective-,
and unified pluralization algorithms. All of the PL_...()
subroutines take the singular form of the word to be inflected as their
first argument and return the corresponding plural form. Each function
also takes an optional second parameter, specifying the required grammatical number of
the word. The plural inflection is only performed if the specified number
is not 1.
The subroutines are:
-
PL_N($;$)
-
PL_N() takes a singular English noun or pronoun and returns
its plural.
-
-
PL_V($;$)
-
PL_V() takes the singular form of a conjugated verb (one
which is already in the correct grammatical person and mood) and returns
the corresponding plural conjugation.
-
-
PL_ADJ($;$)
-
PL_ADJ() takes the singular form of certain types of adjectives
and returns the corresponding plural form.
-
-
PL($;$)
-
PL() takes a singular English noun, pronoun, verb, or adjective
and returns its plural form. Where a word has more than one inflection
depending on its part of speech, the (singular) noun sense is generally
preferred to the (singular) verb sense. Of course, the inherent ambiguity
of such cases suggests that, where the part of speech is known, PL_N(),
PL_V(), and PL_ADJ() should be used in
preference to PL().
These would typically be used as follows:
$food = "fish";
print PL_ADJ("My"), PL_N(" child"), PL_V(" eats "), PL($food);
which would produce:
Our children eat fish
Modern vs classical inflections
Lingua::EN::Inflect can optionally distinguish between
modern and classical plural variants (for example, radiuses
versus radii) via the exportable subroutine classical().
If classical() is called with no arguments, it unconditionally
invokes classical mode. If it is called with an argument, it invokes classical
mode only if that argument evaluates to true. If the argument is false,
classical mode is switched off. In classical mode, the non-anglicized plural
form of a word (if one exists) is preferred. Hence, whereas dogma
is normally inflected to dogmas, if classical mode is active
it becomes dogmata.
Interpolating inflections in strings - The inflect() subroutine
By far the commonest use of the inflection subroutines is to produce message
strings for various purposes. Unfortunately the need to separate each PL_...()
subroutine call often detracts from the readability of the resulting code.
To ameliorate this problem, Lingua::EN::Inflect provides
an exportable string-interpolating subroutine (inflect($)),
which recognizes calls to the various inflection subroutines within a string
and interpolates them appropriately.
Using inflect(), plurals can be interpolated directly
into a string as follows:
print inflect "$errs
PL_N(error,$errs) PL_V(was,$errs) detected.\n";
Comparing "number-insensitively" - The PL_..._eq() subroutines
Lingua::EN::Inflect also implements the number-insensitive
comparison algorithm mentioned above, providing the exportable subroutines
PL_eq($$), PL_N_eq($$), PL_V_eq($$),
and PL_ADJ_eq($$). Each of these subroutines takes two
strings, and compares them using the corresponding plural-inflection subroutine
(PL(), PL_N(), PL_V(),
and PL_ADJ() respectively).
The actual value returned by the various PL_eq_...()
subroutines encodes which of the three equality rules succeeded:
"eq" is returned if the strings were identical, "s:p"
if the strings were singular and plural respectively, "p:s"
for plural and singular, and "p:p" for two distinct plural
forms of a word (for example, octopuses and octopodes).
Inequality is indicated by returning an empty string.
Conclusion
Capturing the English plural inflection in reliable algorithms proves to
be a feasible, if challenging, task. The robustness of such algorithms
depends heavily on encoding general rules (categories of inflection), rather
than attempting to enumerate many hundreds of exceptions to the universal
defaults.
It is possible to cater for differences in major usage patterns (for
example, modern and classical inflections) and for local differences in
dialect (via user-defined inflections). It is also possible to make use
of the pluralization algorithms to efficiently detect pairs of words which
differ only in grammatical number.
References
-
[1]
-
Wall, L., Christiansen, T., & Schwartz, R.L., Programming Perl,
2nd Edition, O'Reilly & Associates, 1996.
-
[2]
-
McCrum, R., Cran, W., & MacNeil, R., The Story of English, Penguin
Books, New York, 1986.
-
[3]
-
Bryson, B., The Mother Tongue: English and how it got that way,
William Morrow, New York, 1990.
-
[4]
-
Thomson, A.J., & Martinet, A.V., A Practical English Grammar, Fourth
Edition, Oxford University Press, Oxford, 1986.
-
[5]
-
The Oxford English Dictionary, Second Edition, Oxford University
Press, Oxford, 1989.
-
[6]
-
Fowler, H.W., Modern English Usage, Second Edition, Oxford University
Press, Oxford, 1965.