To our knowledge, there is only one other publicly accessible frequency dictionary of English that is based on a large mega-corpus—Word Frequencies in Written and Spoken English (Leech, Rayson, and Wilson, 2001). However, our dictionary is quite different in at least three major respects. First, the Longman frequency dictionary represents British, not American, English, and it bases its word-frequency information on the British National Corpus (BNC). Second, most of the texts in the BNC are at least 20 years old, while texts in the Corpus of Contemporary American English (COCA) are current through late 2008. Third, while both corpora are balanced for genre (e.g. spoken, fiction, newspaper, and academic), COCA (385 million words as of 2008, currently 400 million and growing by 20 million per year) is nearly four times as large as the BNC (100 million words), allowing us to have more confidence in determining the words that should "make the list" and in finding their meaningful neighbors.
In addition to the differences in focus, age, and sampling size between the two dictionaries, there are also differences in the presentation formats. The Longman dictionary is mainly composed of straight frequency lists of words and lemmas, while this dictionary is oriented specifically to language learners, supplementing the frequency listings with the unique features previously mentioned: (a) frequency-ranked collocates (co-occurring words) for each headword in the frequency dictionary— which can help learners and their teachers better understand the meanings and uses of the high frequency words; and (b) the more than 30 thematically oriented vocabulary lists (call-out boxes) for particular semantic, grammatical, or lexical categories that would be helpful for language training purposes.
The corpus
A frequency dictionary is only as good as the corpus on which it is based. The Corpus of Contemporary American English (COCA) is the largest balanced corpus of American English, and the largest balanced corpus of any language that is publicly available (http://www.americancorpus.org
). In addition to being very large (currently over 400 million words; 20 million words each year 1990-2008), the corpus is also balanced evenly between spoken (unscripted conversation from 150+ radio and TV shows), fiction (e.g. books, short stories, movie scripts), 100+ popular magazines, ten newspapers, and 100+ academic journals—for a total of 150,000+ texts.
The more than 150,000 texts come from a variety of sources:
Spoken: (79 million words) transcripts of unscripted conversation from more than 150 different TV and radio programs (e.g. All Things Considered (NPR), Newshour (PBS), Good Morning America (ABC), Today Show (NBC), 60 Minutes (CBS), Hannity and Colmes (Fox), Jerry Springer, etc.). (See notes on the naturalness and authenticity of the language from these transcripts.)
Fiction: (76 million words) short stories and plays from literary magazines, children's magazines, popular magazines, first chapters of first edition books 1990-present, and movie scripts.
Popular magazines: (81 million words) nearly 100 different magazines, with a good mix (overall, and by year) between specific domains (news, health, home and gardening, women, financial, religion, sports, etc.). A few examples are Time, Men's Health, Good Housekeeping, Cosmopolitan, Fortune, Christian Century, Sports Illustrated, etc.
Newspapers: (76 million words) ten newspapers from across the US, including: USA Today, New York Times, Atlanta Journal Constitution, San Francisco Chronicle, etc. In most cases, there is a good mix between different sections of the newspaper, such as local news, opinion, sports, financial, etc. · Academic journals: (76 million words) nearly 100 different peer-reviewed journals. These were selected to cover the entire range of the Library of Congress classification system (e.g. a certain percentage from B (philosophy, psychology, religion), D (world history), K (education), T (technology), etc.), both overall and by number of words per year.
In summary, the corpus is very well balanced at both the "macro" level (e.g. spoken, fiction, newspapers) and the "micro" level (i.e. the types of texts and the distribution of the sources) within each of these macro genres.
Annotating and organizing the data from the corpus
In order to create a frequency dictionary, the words in the corpus must be tagged (for part of speech) and lemmatized. Tagging means that a part of speech is assigned to each word—noun, verb, and so on. Lemmatization means that each word form is assigned to a particular "head word" or "lemma", such as go, goes, going, went, and gone being marked as forms of the lemma go.
The tagging and lemmatization was done with the CLAWS tagger (Version 7), which is the same tagger that was used for the British National Corpus (http://www.natcorp.ox.ac.uk/
) and for other important corpora of English as well. One of the most difficult parts of tagging, of course, is to correctly assign the part of speech for words that are potentially ambiguous. In cases such as computer, disturb, lazy, or fitfully, these are unambiguously tagged as noun, verb, adjective, and adverb, respectively. But in a case such as light, the word can be a noun (he turned on the light), verb (should we light the fire?), or adjective (there was a light breeze). In these circumstances, the tagger looks at the context in which the word occurs in each instance to determine the correct part of speech. While the CLAWS tagger is very good, it does produce errors. We have tried to correct for most of these, but there are undoubtedly still some that remain.
It of course makes sense to provide separate entries in the dictionary for words with different parts of speech, such as noun and verb. For example,
the word beat as a noun has collocates such as hear, miss, steady, drum, and rhythm. As a verb, however, it takes collocates such as heart, egg, bowl, severely, or Yankees. Even in cases where the word appears as a noun and an adjective (magic, potential, dark, veteran), the collocates for the two parts of speech are very different, and it would probably be too confusing to conflate them into one entry. Perhaps the most problematic are function words such as since, which appear up to three times in this dictionary. In the case of since, for example, it appears as preposition (he's been here since 1942), adverb (several other schools have since been constructed), and conjunction (since they won't be here until 5 pm, we'll just leave for a minute). In these cases, we have simply followed the output of the tagger. If it says that there are multiple different parts of speech, then the word appears under each of those parts of speech in the dictionary.
Frequency and dispersion
After the tagging and lemmatization of the 400 million words in the corpus, our final step was to determine exactly which of these words would be included in the final list of the 5,000 most frequent words (or lemmas). One approach would be to simply use frequency counts. For example, all lemmas that occur 5,000 times or more in the corpus might be included in the dictionary. Imagine, however, a case where a particular scientific term was used repeatedly in engineering articles or in sports reporting in newspapers, but it did not appear in any works of fiction or in any of the spoken texts. Alternatively, suppose that a given word is spread throughout an entire register (spoken, fiction, newspaper, or academic), but that it is still limited almost exclusively to that register. Should the word still be included in the frequency dictionary? The argument could be made that we should look at more than just raw frequency counts in cases such as this, and that we ought to look at "dispersion" as well, or how well the word is "spread across" all of the registers in the entire corpus.