Выбрать главу

In our dictionary, we have used Juilland's "D dispersion index". A score of 1.00 means that the word is perfectly spread across the corpus, so that if we divided the corpus into one hundred equally sized sections (each with 4 million words, in the case

3134

convincing

J

0.96

3107

sensible

J

0.95

3041

honesty

n

0.96

3033

unusually

r

0.95

3020

confusing

J

0.97

3014

exaggerate

v

0.96

2950

distraction

n

0.95

2922

resent

v

0.96

2891

wrestle

v

0.95

2876

urgency

n

0.96

2873

hint

v

0.96

2842

obsessed

J

0.95

2833

genuinely

r

0.96

2813

respected

J

0.95

4653

healthcare

n

0.56

4282

electron

n

0.58

4181

skier

n

0.43

4113

compost

n

0.31

3685

watercolor

n

0.41

3769

ski

v

0.47

2028

nebula

n

0.46

2547

palette

n

0.57

2536

angle

v

0.55

2479

algorithm

n

0.52

2437

pastel

n

0.25

2388

socket

n

0.60

2350

nasal

J

0.44

2281

cache

n

0.43

Table 1 Contrast between frequency and dispersion

Good dispersion Poor dispersion

Frequency Lemma PoS Dispersion Frequency Lemma PoS Dispersion

of our nearly 400 million word corpus), the word would have exactly the same frequency in each section. A dispersion score of .10, on the other hand, would mean that it occurs a lot in a handful of sections, and perhaps not at all or very little in the other sections.

As a clear example of the contrast between "frequency" and "dispersion", consider Table 1. All of the words in this table have essentially the same frequency—an average of about 3,000 occurrences in the corpus. The words to the left, however, have a "dispersion" score of at least 0.95, which means that the word has roughly the same frequency in all of the 100 sections of the corpus that we used for the calculation. The words to the right, on the other hand, have a much lower dispersion score. Most would easily agree that the words shown at the left would be more useful in a frequency dictionary, because they represent a wide range of texts and text types in the corpus. Therefore, as we can see, frequency alone is probably not sufficient to determine whether a word should be in the dictionary.

The final calculation

The calculation to determine which words are included in this frequency dictionary was a fairly straightforward one. The formula was simply:

score = frequency * dispersion

For example, consider the words near 3210 in the frequency dictionary (see Table 2). The word furthermore has a higher frequency (9594 tokens) than the other two words, but it has lower dispersion (.86). Orange, on the other hand, has a lower frequency (8881 tokens) but it has better dispersion across the corpus. Taxpayer (frequency of 9140 and dispersion of .90) is in the middle of both of these. But with the formula that takes into account both frequency and dispersion, these three words end up having more or less the same score.

Table 2 Frequency and dispersion

ID

Lemma

PoS Frequency

Dispersion

Score

3207

orange

J 8881

0.93

8270

3209

taxpayer

n 9140

0.90

8256

3213

furthermore

r 9594

0.86

8235

The 5,000 lemmas with the top score (frequency * dispersion) are those that appear in this frequency dictionary.

Collocates

A unique feature of this frequency dictionary is the listing of the top collocates (nearby words) for each of the 5,000 words in the frequency listing. These collocates provide important and useful insight into the meaning and use of the keyword. To find the collocates, we did the following. First, we decided which parts of speech to group together in order to rank the collocates and show the most frequent ones. In the case of verbs, we grouped noun collocates (subject: the evidence supports what she said, and object: this supports the claim), and all other collocates were grouped as miscellaneous (e.g. with, directly, difficult, and prepare for the verb deal). For nouns, we looked for adjectives (green grass), other nouns (fire station), and verbs (e.g. desire to succeed). For adjectives, we looked for nouns (fast car) and all other collocates were grouped as miscellaneous (completely exhausted, willing to stay, black and white). Finally, for adverbs and other parts of speech, we see collocates from all parts of speech listed together (sharply reduce, fewer than, except for).

To find the collocates for a given word, a computer program searched the entire 385-million-word corpus and looked at each context in which that word occurred. In all cases, the context (or "span") of words was four words to the left and four words to the right of the "node word". The overall frequency of the collocates in each of those contexts was then calculated, and the collocates were examined and rated by at least four native speakers.

Obviously, common words such as the, of, to, etc. were usually the most frequent collocates. To filter out these words, we set a Mutual Information (MI) threshold of about 2.5. The MI calculation took into account the overall frequency of each collocate, so that common words were usually eliminated from the list.

Using MI is sometimes more an art than a science. If the MI is set too low, then high frequency "noise words" show up as collocates, whereas if it is set too high, then only highly idiomatic collocates are found. As an example, the most frequent collocates of break as a verb—when the MI score is set high at 5.5—are: deadlock, logjam, monotony, and stranglehold. These are quite idiomatic and don't really show well the "core meaning" of break. On the other hand, the most frequent collocates when the MI threshold is set very low at 1.0 are down, into, up, and off, which again do not provide a good sense of its meaning. Finally, however, when we set the MI threshold to 2.5, we find the most frequent collocates are: heart, silence, rules, loose, leg, and barriers, which (for native speakers, at least), probably do relate more to the core meaning and usage of break. But getting the MI threshold set just right for each of the 5,000 headwords was a bit daunting, to say the least. We hope that the data found here agree with your intuitions of what these words mean and how they are used.