Relative Frequencies of English Phonemes
11.10.2012 § 43 Comments
I’ve roughly estimated the relative frequencies that English phonemes get used.
I did so by correlating the Carnegie Mellon University Pronouncing Dictionary with Adam Kilgarriff’s unlemmatized frequency list for the British National Corpus. I used the former as a phonemic lexicon and the latter as a sample set, weighting the phonemes by the totals of the relative frequencies of each of the words they appear in.
My methodology was messy. CMU’s Pronouncing Dictionary conflated schwa with the near-open central vowel, and had several noticeable errors. The BNC list had multiple entries different parts of speech of words, and formatting issues prevented me from using any words with accents or apostrophes, including common contractions (though I found that the counts were so unreasonably low for these entries, that Kilgarriff must have split all of them up already except for the ones in the spoken ‘demog’ component). I believe my manual error checking on the top few hundred words (accounting for 47 million of the 81 million total word usage instances) helped significantly.
So here you have it:
ə / 11.49%
n / 7.11%
r / 6.94%
t / 6.91%
ɪ / 6.32%
s / 4.75%
d / 4.21%
l / 3.96%
i / 3.61%
k / 3.18%
ð / 2.95%
ɛ / 2.86%
m / 2.76%
z / 2.76%
p / 2.15%
æ / 2.10%
v / 2.01%
w / 1.95%
u / 1.93%
b / 1.80%
e / 1.79%
ʌ / 1.74%
f / 1.71%
aɪ / 1.50%
ɑ / 1.45%
h / 1.40%
o / 1.25%
ɒ / 1.18%
ŋ / 0.99%
ʃ / 0.97%
j / 0.81%
g / 0.80%
dʒ / 0.59%
tʃ / 0.56%
aʊ / 0.50%
ʊ / 0.43%
θ / 0.41%
ɔɪ / 0.10%
ʒ / 0.07%
Thanks so much. I understand your comments about “messiness”, but a thing doesn’t have to be perfect to be good.
Thanks! I’m delighted that this saved you some effort, was worth something to you, and also that you wondered something I had wondered in the first place!
This is a huge help for a project that I am working- Does this take into account an American accent? ie how in certain settings T and D pronunciation are switched? or instances in which glottal stops are most commonly used? etc-etc. if not do you have ideas for how I might take into account American Pronunciation variation (assuming that I generate a “standardized” American pronuncation)
As I recall, the data does use an American accent, because it is an American university’s phonological data (only the sample set of word usage is British).
As for d/t/ʔ, unfortunately I don’t remember the pronouncing dictionary taking into account the glottal stop and its (as far as I understand) unique situation of being poly-phonemic, that is, that both d and t when unstressed can become it. Therefore I have no information on it, and it is quite likely that more accurate examination would pull both t and d down on this chart and situate a ʔ fairly high up!
The links I’ve included to both the corpus and the dictionary appear to be still functioning and both downloads are free. Feel free to take my strategy and use your own approach/tweaks.
Glad this helped!
What were these noticeable errors?
What did you do with the “manual error checking”?
I don’t recall specifically. Something like “what” being written as if it were pronounced like “wet”, etc. My manual checking was just going through the first 100 most common words and correcting any such errors.
At least “what” seems alright:
WHAT W AH1 T
WHAT(2) HH W AH1 T
Does your coding table agree with this?
AA ɑː
AA0 ɑː
AA1 ɑː
AA2 ɑː
AE æ
AE0 æ
AE1 æ
AE2 æ
AH ə
AH0 ə
AH1 ʌ
AH2 ʌ
AO ɔː
AO0 ɔː
AO1 ɔː
AO2 ɔː
AW aʊ
AW0 aʊ
AW1 aʊ
AW2 aʊ
AY aɪ
AY0 aɪ
AY1 aɪ
AY2 aɪ
B b
CH ʧ
D d
DH ð
EH ɛ
EH0 ɛ
EH1 ɛ
EH2 ɛ
ER ɚ
ER0 ɝ
ER1 ɝ
ER2 ɝ
EY eɪ
EY0 eɪ
EY1 eɪ
EY2 eɪ
F f
G ɡ
HH h
IH ɪ
IH0 ɪ
IH1 ɪ
IH2 ɪ
IY iː
IY0 iː
IY1 iː
IY2 iː
JH ʤ
K k
L l
M m
N n
NG ŋ
OW oʊ
OW0 oʊ
OW1 oʊ
OW2 oʊ
OY ɔɪ
OY0 ɔɪ
OY1 ɔɪ
OY2 ɔɪ
P p
R r
S s
SH ʃ
T t
TH θ
UH ʊ
UH0 ʊ
UH1 ʊ
UH2 ʊ
UW uː
UW0 uː
UW1 uː
UW2 uː
V v
W w
Y j
Z z
ZH ʒ
1) I honestly can’t remember or find a record of how I handled ə. It’s possible that I changed all unstressed vowels to ə. Otherwise, I would have done as you have here, changed only AH0 to it. Obviously this could make a huge difference.
2) My CMU pronunciation dictionary did not have a concept of vowels without primary, secondary, or no stress. That is, I only have XX0, XX1, XX2… I don’t have plain XX. Not sure what that would represent.
3) I wrote both oʊ and ɔ as o. I only get ɔ before r’s and don’t distinguish.
4) For ER I did not use ɚ or ɝ ; I used ər. I know it may not be the most accurate.
5) Just in terms of how I labeled stuff above, I didn’t include length (ː)
What I was looking for was a frequency distribution for syllables, onsets, and rhymes respectively (using a syllabified CMU dataset: http://webdocs.cs.ualberta.ca/~kondrak/cmudict.html).
weighted by frequency of the word’s occurrence in the written language, as mine is?
Yes.
Forget to add it six years ago, but here it is https://medium.com/wugs/high-frequency-syllables-in-english-ab75159618a0
Can you point me to an article I can reference for this material?
Pam,
I’m very sorry — I should have been more explicit that this is not an academic project. I never expected it would get this much attention online — I mostly did it out of my own curiosity! It’s only a “rough estimate” using the CMU pronouncing dictionary, Kilgarriff’s freq list, and the BNC per the above links — plus my “messy methodology”. If you want to cite my conclusions for science (I see you’re a speech therapist?), unfortunately, a more rigorous look will need to be taken. Let’s stay in touch here, though, in case either of us do come across something like that!
Best,
Doug
There is a bit of an oversight here depending on the purpose of constructing your list.
This oversight is that you’ve used a dictionary rather than a corpus to determine occurrence.
Here’s an example of what I mean:
If I’m using dictionary, I’ll find that the word “the” occurs once, so “the” will contribute the sound ð only once. Meanwhile, if I use a spoken corpus, the word “the” will occur about 3.5% of the time, and will thus contribute the sound ð many times.
A better methodology is to assemble a large corpus, perhaps 10,000 words or more, of written or spoken dialogue, then use phonetics library to convert that to IPA, then tally the occurrence of each phoneme.
Cheers!
James,
Thank you for your comment. I did actually use the British National Corpus for occurrence, for this very reason!
This is really useful for teachers and learners of English so thanks very much for posting it. There is another list, but it dates from around 1980 (see A.C.Grimson, ‘An Introduction to the Pronunciation of English’). Although the lists are broadly similar in terms of rank, there are a number of differences, some undoubtedly due to changes over the last 30 or 40 years. Also, I think Grimson’s list is for British pronunciation (I don’t have the original book, just the table). Main points:
1. The biggest difference is /r/ (3.51% vs. your 6.94%, an obvious difference between British and American).
2. The schwa used to be less frequent (10.74% according to Grimson), a difference due at least in part to change over time. Actually, I’m surprised the difference isn’t more, but maybe the schwa is less common in American English.
3. /d/ is ranked above /s/ on Grimson’s list. Not sure why.
4. Another notable difference is /ɪ/. Grimson has it at 8.33% compared to your 6.32%. Change over time?
5. /ʊ/ on your list is half that (0.43%) of Grimson’s (0.86%). Again, not sure why.
6. /w/ is much more frequent according to Grimson (2.81% vs. your 1.95%).
I’m thinking you may well have come across Grimson’s data in the course of thinking about how to compile your own. If so, I’d be interested in any comments you might have.
Finally, I noticed that both lists omit /i/ (sometimes known as the Spanish ‘i’), as at the end of words such as ‘happy’. I’m wondering why…
Thank you for your detailed comment!
I hadn’t come across Grimson’s data, as far as I recall, no. And while I am interested, I don’t have any further insight into the differences in our results than you’ve already surmised.
I’m glad this post is of use to folks. Very little on this blog of mine can claim as much. That said, I did do this just for my own amusement; an academic level of accuracy was far down on my list of priorities.
This is a great project! I am surprised that the voiced interdental fricative is not higher up the list, as ‘the’ and the ‘th’ pronouns make up 10% of the top 100 high frequency words.
That’s word frequency, longer words have more phonemes each.
What is that calligraphy in the background??
That’s the design for my tattoo. It’s roman script, just sideways in a font I designed.
Hello, I am working on a research paper for my senior design class at UCF. I came across the english phenomes by commoness table and found it very useful, and would like to use a screenshot of it in my paper as a reference. Would this be okay? I will be citing appropriately.
No problem. Please just understand that my methodology was not particularly rigorous. I did this mostly to satisfy my own interest.
Hello! Thanks for this, I’ve found it really useful. I ended up needing to go a little bit further, so I ended up mostly-replicating and extending your work a bit. If you’re interested: https://github.com/prendradjaja/phoneme-frequencies
Wow! Thank you for the reference. I think your results are probably superior given your inclusion of ɝ, eɪ, and ɔ. Also, thanks for sharing your code – I took my attempt at this before I learned to code and did it all in spreadsheet software. I’m glad to see that someone else wanted to answer the same question! Ultimately I’d love to see how this varies across different English dialects, or even other languages, perhaps generalized by articulation methods and positions so that we could get a sort of heat map across all human speech! But I doubt I’d ever get to it myself 🙂
I’ve discovered that back in 1950 Rebecca E. Hayden published a similar study. This link was published in Dec. 2015, a few years after my original post: https://www.tandfonline.com/doi/pdf/10.1080/00437956.1950.11659381
For those of you looking for more academic approaches, her paper may fit your needs better than my informal one. Her results are fairly similar to mine.
I was recently writing on the dichotomy of the high frequency of the ‘z” sound in the English language versus the relatively infrequent use of the actual letter. Ie. the use of “s” between vowels is a z sound [ex. rise] as well as in most plurals [plurals, cats] as well as possessives [boy’s]. My guess is that your study did take into account words which incorporate the z sounds generated by s used within words [rise, his…] but maybe not for plurals which in word frequency analysis are often reduced to an upcountry for the non-pluralized form. However, since plurals are very common, this omission would significantly understate the occurrence of the z phoneme in the English language. Do you agree?
Paul,
That is an important observation. It has been too long since I did the work for me to recall confidently, but I do claim to have used the *un*lemmatized frequency list, that is, with plurals, possessives, and other variant forms of words separated. So I believe the representation of the ‘z’ sound should be as accurate as anything else here, however accurate that may be.
Thank you for your interest!
Doug
Hi there. This is fantastic! Can you clarify the difference between /e/ and /ɛ/ as reflected in this data set? Example words would be appreciated!
Thank you Rob! I’m glad you like it.
/ɛ/ is as in “bed”
and I’m pretty sure when I wrote /e/ I meant the long version /eɪ/ as in “day”.
[…] (from Relative Frequencies of English Phonemes) […]
[…] (from Relative Frequencies of English Phonemes) […]
So, we can say that the “ɪ” sound is the most used one (beside the schwa sound as some zero-like one), if you adds up the “aɪ” sounds amount (analysed as some “a” plus “ɪ”).
Thanks for the comment!
I understand your suggestion. You have a reasonable point; you could make the case for [ɪ] being most common.
I’m not sure if a professional linguist would agree; some differ on how they consider constituent phones of diphthongs, i.e. is the [ɪ] in [aɪ] the same [ɪ] as when it stands alone? Maybe, maybe not.
It also depends on what question you’re trying to answer. If you’re trying to answer “which vowel sound divided by word breaks or consonant clusters is most common” then it doesn’t matter that [ɪ] is contained in [aɪ].
You may also want to consider that [ɪ] often occurs in unstressed syllables too. It’s borderline “zero-like”, like the schwa.
Amazing work here even after all these years. I’m wondering if you could explain why there are only 39 phonemes listed when English is generally regarded to have 42-44 (depending on pronunciation). Thanks!
Thanks for your comment Patrick! It has been a long time, so I don’t remember the details of my decision making. This was a super informal project of mine. Per the link above, the CMU Pronouncing Dictionary uses only 39 phonemes: http://www.speech.cs.cmu.edu/cgi-bin/cmudict and I probably didn’t look into the matter any further than that.
[…] Blumeyer, D. 2012. Relative frequencies of English Phonemes. Available online at https://cmloegcmluin.wordpress.com/2012/11/10/relative-frequencies-of-english-phonemes/ […]
You have done a great job. Thanks for making it available. Do you have any information on the commonest spellings for each phoneme?
Interesting question! Well, I just Googled and immediately found this, which looks reasonable enough: http://wp.auburn.edu/rdggenie/home/teaching-ideas/spcat/
This may have been a good thing for me to consider in preparation for this other project of mine: https://cmloegcmluin.wordpress.com/2011/12/01/my-g-a-phonemic-transcription/
[…] to this, the most common consonant sound in the British National Corpus is /n/, closely followed by /r/ […]
Can you run your software to produce the percentage frequencies of phoneme pairs and phoneme triples? I am designing a Shavian keyboard (QWERTY US/AU standard) to place the most frequently occurring single, double, and triple phonemes on the unshifted keys, attempting to alternate the components of doubles and triples on alternate hands. Shifted
Keys will be for infrequently encountered phonemes, punctuation, and simple arithmetic characters, etc.
Regards,
Hedley
hed ley . fin ger AT g mail . c om
Sorry Hedley. This project was 10 years ago and so I don’t know where I’ve put the materials I used to work it out. Actually that was even before I became a software engineer so I probably just did it in some big spreadsheet. My process is described above though and it should be somewhat easily repeatable if you want to take it in a slightly different direction, which I encourage you to do. Good luck with your project. It sounds interesting!