In our dictionary, we have used Juilland's "D dispersion index". A score of 1.00 means that the word is perfectly spread across the corpus, so that if we divided the corpus into one hundred equally sized sections (each with 4 million words, in the case
3134
convincing
J
0.96
3107
sensible
J
0.95
3041
honesty
n
0.96
3033
unusually
r
0.95
3020
confusing
J
0.97
3014
exaggerate
v
0.96
2950
distraction
n
0.95
2922
resent
v
0.96
2891
wrestle
v
0.95
2876
urgency
n
0.96
2873
hint
v
0.96
2842
obsessed
J
0.95
2833
genuinely
r
0.96
2813
respected
J
0.95
4653
healthcare
n
0.56
4282
electron
n
0.58
4181
skier
n
0.43
4113
compost
n
0.31
3685
watercolor
n
0.41
3769
ski
v
0.47
2028
nebula
n
0.46
2547
palette
n
0.57
2536
angle
v
0.55
2479
algorithm
n
0.52
2437
pastel
n
0.25
2388
socket
n
0.60
2350
nasal
J
0.44
2281
cache
n
0.43
Table 1 Contrast between frequency and dispersion
Good dispersion Poor dispersion
Frequency Lemma PoS Dispersion Frequency Lemma PoS Dispersion
of our nearly 400 million word corpus), the word would have exactly the same frequency in each section. A dispersion score of .10, on the other hand, would mean that it occurs a lot in a handful of sections, and perhaps not at all or very little in the other sections.
As a clear example of the contrast between "frequency" and "dispersion", consider Table 1. All of the words in this table have essentially the same frequency—an average of about 3,000 occurrences in the corpus. The words to the left, however, have a "dispersion" score of at least 0.95, which means that the word has roughly the same frequency in all of the 100 sections of the corpus that we used for the calculation. The words to the right, on the other hand, have a much lower dispersion score. Most would easily agree that the words shown at the left would be more useful in a frequency dictionary, because they represent a wide range of texts and text types in the corpus. Therefore, as we can see, frequency alone is probably not sufficient to determine whether a word should be in the dictionary.
The final calculation
The calculation to determine which words are included in this frequency dictionary was a fairly straightforward one. The formula was simply:
score = frequency * dispersion
For example, consider the words near 3210 in the frequency dictionary (see Table 2). The word furthermore has a higher frequency (9594 tokens) than the other two words, but it has lower dispersion (.86). Orange, on the other hand, has a lower frequency (8881 tokens) but it has better dispersion across the corpus. Taxpayer (frequency of 9140 and dispersion of .90) is in the middle of both of these. But with the formula that takes into account both frequency and dispersion, these three words end up having more or less the same score.
Table 2 Frequency and dispersion
ID
Lemma
PoS Frequency
Dispersion
Score
3207
orange
J 8881
0.93
8270
3209
taxpayer
n 9140
0.90
8256
3213
furthermore
r 9594
0.86
8235
The 5,000 lemmas with the top score (frequency * dispersion) are those that appear in this frequency dictionary.
Collocates
A unique feature of this frequency dictionary is the listing of the top collocates (nearby words) for each of the 5,000 words in the frequency listing. These collocates provide important and useful insight into the meaning and use of the keyword. To find the collocates, we did the following. First, we decided which parts of speech to group together in order to rank the collocates and show the most frequent ones. In the case of verbs, we grouped noun collocates (subject: the evidence supports what she said, and object: this supports the claim), and all other collocates were grouped as miscellaneous (e.g. with, directly, difficult, and prepare for the verb deal). For nouns, we looked for adjectives (green grass), other nouns (fire station), and verbs (e.g. desire to succeed). For adjectives, we looked for nouns (fast car) and all other collocates were grouped as miscellaneous (completely exhausted, willing to stay, black and white). Finally, for adverbs and other parts of speech, we see collocates from all parts of speech listed together (sharply reduce, fewer than, except for).
To find the collocates for a given word, a computer program searched the entire 385-million-word corpus and looked at each context in which that word occurred. In all cases, the context (or "span") of words was four words to the left and four words to the right of the "node word". The overall frequency of the collocates in each of those contexts was then calculated, and the collocates were examined and rated by at least four native speakers.
Obviously, common words such as the, of, to, etc. were usually the most frequent collocates. To filter out these words, we set a Mutual Information (MI) threshold of about 2.5. The MI calculation took into account the overall frequency of each collocate, so that common words were usually eliminated from the list.
Using MI is sometimes more an art than a science. If the MI is set too low, then high frequency "noise words" show up as collocates, whereas if it is set too high, then only highly idiomatic collocates are found. As an example, the most frequent collocates of break as a verb—when the MI score is set high at 5.5—are: deadlock, logjam, monotony, and stranglehold. These are quite idiomatic and don't really show well the "core meaning" of break. On the other hand, the most frequent collocates when the MI threshold is set very low at 1.0 are down, into, up, and off, which again do not provide a good sense of its meaning. Finally, however, when we set the MI threshold to 2.5, we find the most frequent collocates are: heart, silence, rules, loose, leg, and barriers, which (for native speakers, at least), probably do relate more to the core meaning and usage of break. But getting the MI threshold set just right for each of the 5,000 headwords was a bit daunting, to say the least. We hope that the data found here agree with your intuitions of what these words mean and how they are used.