Выбрать главу

The main frequency index

The main index in this dictionary is a rank-ordered listing of the top 5,000 words (lemma) in English, starting with the most frequent word (the definite article the) and progressing through to parish, rejection, and mutter, which are the last three words in the list. The following information is given for each entry:

rank frequency (1, 2, 3,...), lemma, part of speech

collocates, grouped by part of speech and ordered by frequency

raw frequency, dispersion (0.00-1.00), (indication of register variation)

As a concrete example, let us look at the entry for the verb break:

501 break v

n ·law, heart, news, ·rule, silence, story, ·ground, ·barrier, leg, bone, ·piece, ·neck, arm, ·cycle, voice^ misc ·into, ·away, ·free, ·apart, ·loose

up marriage, ·fight, boyfriend, meetings girlfriend, union, band, pass, ·demonstration, ·monotony down ·into, ·barrier, carv ·cry, ·door, ·tear, talkv enzyme^, completely, negotiation out war„ fight, firev sweat, fighting,, riot., violence^, ·laugh, ·hive off piece, talk, ·engagement, negotiation, branch, abruptly, ·relation 72917 | 0.97

This entry shows that word number 501 in our rank order list is the verb break. The last line of the entry shows the raw frequency for the lemma (72,917 tokens) and the dispersion (.97 in this case). The collocates are given in the intervening lines. As can be seen, they are partially grouped by part of speech In the case of verbs, we see the noun collocates and then other parts of speech (miscellaneous).

Note also that for some collocates, there is an indication of the placement of the collocate. When the [ · ] is before the collocate, this means that the node word (headword) is typically found before that collocate (break the law, break into pieces). When the [ · ] is after the collocate, this means that the node word is typically found after the collocate (her voice broke, all hell broke loose). This symbol can provide useful information, for example, on whether the collocates are subjects or objects of a given verb, or whether the node word noun acts as a subject or object of the verbal collocate. (Note, however, that with passives and relative clauses, the noun that is object of a verb will occur before the verb, which does confuse things a bit.) In order to display the [ · ] symbol, 80 percent or more of the tokens of a given collocate had to occur either before or after the node word. In the case of ADJ / NOUN and NOUN / ADJ, word order is typically so consistent (blue house, never *house blue) that the [ · ] is not used to show placement.

Finally, as is seen above, in the case of some verbs that can act as phrasal verbs (break up, turn down, cut off, etc.), these are listed in bold (with their own collocates) at the end of the regular collocates list for verbs. Phrasal verbs are only listed when they have a frequency of at least 1,000 in the corpus, and when there are at least three collocates with a frequency of at least five occurrences each.

Let us consider one other example:

3404 hypothesis n

j null, following, consistent, alternative, working, general, initial, original, theoretical, competing n study, support., result, test, research, testing, evidence, analysis, method, set v .predict, suggest, reject, examine, confirm, base, develop, formulate, .state, .explain 9282 | 0.82 A

This entry is for hypothesis (word #3404 in our list). As before, the collocates are listed in frequency order and grouped by part of speech. In this case, however, note that there is an [ A ] at the end of the entry. This indicates that the lemma hypothesis occurs at least twice as frequently in the Academic genre as it does overall in the corpus (Spoken, Eiction, Magazines, Newspapers).

Thematic vocabulary ("call-out boxes")

Placed throughout the main frequency-based index are 31 "call-out boxes", which serve to display in one list a number of thematically related words. These include thematic lists of words related to the body, food, family, weather, professions, nationalities, colors, emotions, and several other semantic domains. There are also lists of words that are much more common in each of the five main genres (spoken, fiction, popular magazines, newspapers, and academic) than overall, as well as comparisons of American and British vocabulary, as well as new words in the language. Finally, there are lists related to word formation issues, such as irregular past tense and irregular plurals, and common suffixes to create nouns, adjectives, and verbs. In each case, the entries are, of course, ordered by frequency.

Alphabetical and part of speech indexes

The alphabetical index contains all of the words listed in the frequency index. Each entry includes the following information: 1) lemma 2) part of speech, and 3) rank order frequency. The part of speech index contains the 5,000 words from the frequency index and the alphabetical index. Within each of the categories (noun, verb, adjective, etc.) the lemma are listed in order of descending frequency. Because each entry is linked to the other two indexes via the rank frequency number, each of the entries in this index contains only the rank frequency and lemma.

Electronic version

As was noted in the first section, if you find this dictionary valuable and would like to have a similar electronic version (somewhat fewer collocates, but more of other features), feel free to visit http://www.americancorpus.org/dictionary.

Delimitations and Notes

1 Frequency is form-based (lemma), not semantically based (homographs—bank, run; heterophones— lead "metal" vs. lead "be in front", contract vs. contract, etc.). But our approach is an improvement over many similar frequency listings because the collocates give some indication of potential variant meanings. For example, take a look at the entries for lead (n) [entry 1605] and bow (n) [entry 4147]. For lead, there are collocates for the two meanings "metal" and "in front" and for bow there are collocates for bow in the context of "ship, arrow, hair, and violin".

Except in the case of high-frequency phrasal verbs, only single-word nodes were included. When a lemma occurs almost exclusively in a given multi-word expression (as far as, in charge of lots of), that multi-word expression is listed as part of the entry.