Relative Frequencies of English Phonemes

11.10.2012 § 43 Comments

I’ve roughly estimated the relative frequencies that English phonemes get used.

I did so by correlating the Carnegie Mellon University Pronouncing Dictionary with Adam Kilgarriff’s unlemmatized frequency list for the British National Corpus. I used the former as a phonemic lexicon and the latter as a sample set, weighting the phonemes by the totals of the relative frequencies of each of the words they appear in.

My methodology was messy. CMU’s Pronouncing Dictionary conflated schwa with the near-open central vowel, and had several noticeable errors. The BNC list had multiple entries different parts of speech of words, and formatting issues prevented me from using any words with accents or apostrophes, including common contractions (though I found that the counts were so unreasonably low for these entries, that Kilgarriff must have split all of them up already except for the ones in the spoken ‘demog’ component). I believe my manual error checking on the top few hundred words (accounting for 47 million of the 81 million total word usage instances) helped significantly.

So here you have it:

ə / 11.49%
n / 7.11%
r / 6.94%
t / 6.91%
ɪ / 6.32%
s / 4.75%
d / 4.21%
l / 3.96%
i / 3.61%
k / 3.18%
ð / 2.95%
ɛ / 2.86%
m / 2.76%
z / 2.76%
p / 2.15%
æ / 2.10%
v / 2.01%
w / 1.95%
u / 1.93%
b / 1.80%
e / 1.79%
ʌ / 1.74%
f / 1.71%
aɪ / 1.50%
ɑ / 1.45%
h / 1.40%
o / 1.25%
ɒ / 1.18%
ŋ / 0.99%
ʃ / 0.97%
j / 0.81%
g / 0.80%
dʒ / 0.59%
tʃ / 0.56%
aʊ / 0.50%
ʊ / 0.43%
θ / 0.41%
ɔɪ / 0.10%
ʒ / 0.07%

§ 43 Responses to Relative Frequencies of English Phonemes

Chris says:

11.14.2013 at 12:11

Thanks so much. I understand your comments about “messiness”, but a thing doesn’t have to be perfect to be good.

Reply
- CMLOEGCMLUIN says:
  
  11.14.2013 at 18:56
  
  Thanks! I’m delighted that this saved you some effort, was worth something to you, and also that you wondered something I had wondered in the first place!
  
  Reply
Kevin says:

01.26.2014 at 12:33

This is a huge help for a project that I am working- Does this take into account an American accent? ie how in certain settings T and D pronunciation are switched? or instances in which glottal stops are most commonly used? etc-etc. if not do you have ideas for how I might take into account American Pronunciation variation (assuming that I generate a “standardized” American pronuncation)

Reply
- CMLOEGCMLUIN says:
  
  01.29.2014 at 08:21
  
  As I recall, the data does use an American accent, because it is an American university’s phonological data (only the sample set of word usage is British).
  
  As for d/t/ʔ, unfortunately I don’t remember the pronouncing dictionary taking into account the glottal stop and its (as far as I understand) unique situation of being poly-phonemic, that is, that both d and t when unstressed can become it. Therefore I have no information on it, and it is quite likely that more accurate examination would pull both t and d down on this chart and situate a ʔ fairly high up!
  
  The links I’ve included to both the corpus and the dictionary appear to be still functioning and both downloads are free. Feel free to take my strategy and use your own approach/tweaks.
  
  Glad this helped!
  
  Reply
David Rosson says:

07.14.2014 at 02:36

What were these noticeable errors?
What did you do with the “manual error checking”?

Reply
- CMLOEGCMLUIN says:
  
  07.14.2014 at 08:51
  
  I don’t recall specifically. Something like “what” being written as if it were pronounced like “wet”, etc. My manual checking was just going through the first 100 most common words and correcting any such errors.
  
  Reply
  - David Rosson says:
    
    07.15.2014 at 17:07
    
    At least “what” seems alright:
    WHAT W AH1 T
    WHAT(2) HH W AH1 T
    
    Does your coding table agree with this?
    AA ɑː
    AA0 ɑː
    AA1 ɑː
    AA2 ɑː
    AE æ
    AE0 æ
    AE1 æ
    AE2 æ
    AH ə
    AH0 ə
    AH1 ʌ
    AH2 ʌ
    AO ɔː
    AO0 ɔː
    AO1 ɔː
    AO2 ɔː
    AW aʊ
    AW0 aʊ
    AW1 aʊ
    AW2 aʊ
    AY aɪ
    AY0 aɪ
    AY1 aɪ
    AY2 aɪ
    B b
    CH ʧ
    D d
    DH ð
    EH ɛ
    EH0 ɛ
    EH1 ɛ
    EH2 ɛ
    ER ɚ
    ER0 ɝ
    ER1 ɝ
    ER2 ɝ
    EY eɪ
    EY0 eɪ
    EY1 eɪ
    EY2 eɪ
    F f
    G ɡ
    HH h
    IH ɪ
    IH0 ɪ
    IH1 ɪ
    IH2 ɪ
    IY iː
    IY0 iː
    IY1 iː
    IY2 iː
    JH ʤ
    K k
    L l
    M m
    N n
    NG ŋ
    OW oʊ
    OW0 oʊ
    OW1 oʊ
    OW2 oʊ
    OY ɔɪ
    OY0 ɔɪ
    OY1 ɔɪ
    OY2 ɔɪ
    P p
    R r
    S s
    SH ʃ
    T t
    TH θ
    UH ʊ
    UH0 ʊ
    UH1 ʊ
    UH2 ʊ
    UW uː
    UW0 uː
    UW1 uː
    UW2 uː
    V v
    W w
    Y j
    Z z
    ZH ʒ
  - CMLOEGCMLUIN says:
    
    07.16.2014 at 08:31
    
    1) I honestly can’t remember or find a record of how I handled ə. It’s possible that I changed all unstressed vowels to ə. Otherwise, I would have done as you have here, changed only AH0 to it. Obviously this could make a huge difference.
    2) My CMU pronunciation dictionary did not have a concept of vowels without primary, secondary, or no stress. That is, I only have XX0, XX1, XX2… I don’t have plain XX. Not sure what that would represent.
    3) I wrote both oʊ and ɔ as o. I only get ɔ before r’s and don’t distinguish.
    4) For ER I did not use ɚ or ɝ ; I used ər. I know it may not be the most accurate.
    5) Just in terms of how I labeled stuff above, I didn’t include length (ː)
David Rosson says:

07.14.2014 at 02:41

What I was looking for was a frequency distribution for syllables, onsets, and rhymes respectively (using a syllabified CMU dataset: http://webdocs.cs.ualberta.ca/~kondrak/cmudict.html).

Reply
- CMLOEGCMLUIN says:
  
  07.14.2014 at 08:51
  
  weighted by frequency of the word’s occurrence in the written language, as mine is?
  
  Reply
  - David Rosson says:
    
    07.15.2014 at 17:04
    
    Yes.
  - David Rosson says:
    
    04.25.2020 at 11:09
    
    Forget to add it six years ago, but here it is https://medium.com/wugs/high-frequency-syllables-in-english-ab75159618a0
Pam says:

10.10.2014 at 18:55

Can you point me to an article I can reference for this material?

Reply
- CMLOEGCMLUIN says:
  
  10.12.2014 at 10:20
  
  Pam,
  
  I’m very sorry — I should have been more explicit that this is not an academic project. I never expected it would get this much attention online — I mostly did it out of my own curiosity! It’s only a “rough estimate” using the CMU pronouncing dictionary, Kilgarriff’s freq list, and the BNC per the above links — plus my “messy methodology”. If you want to cite my conclusions for science (I see you’re a speech therapist?), unfortunately, a more rigorous look will need to be taken. Let’s stay in touch here, though, in case either of us do come across something like that!
  
  Best,
  Doug
  
  Reply
James says:

04.14.2015 at 21:11

There is a bit of an oversight here depending on the purpose of constructing your list.

This oversight is that you’ve used a dictionary rather than a corpus to determine occurrence.

Here’s an example of what I mean:
If I’m using dictionary, I’ll find that the word “the” occurs once, so “the” will contribute the sound ð only once. Meanwhile, if I use a spoken corpus, the word “the” will occur about 3.5% of the time, and will thus contribute the sound ð many times.

A better methodology is to assemble a large corpus, perhaps 10,000 words or more, of written or spoken dialogue, then use phonetics library to convert that to IPA, then tally the occurrence of each phoneme.

Cheers!

Reply
- CMLOEGCMLUIN says:
  
  04.14.2015 at 21:32
  
  James,
  Thank you for your comment. I did actually use the British National Corpus for occurrence, for this very reason!
  
  Reply
pacifictreeinternational says:

04.20.2016 at 18:56

This is really useful for teachers and learners of English so thanks very much for posting it. There is another list, but it dates from around 1980 (see A.C.Grimson, ‘An Introduction to the Pronunciation of English’). Although the lists are broadly similar in terms of rank, there are a number of differences, some undoubtedly due to changes over the last 30 or 40 years. Also, I think Grimson’s list is for British pronunciation (I don’t have the original book, just the table). Main points:

1. The biggest difference is /r/ (3.51% vs. your 6.94%, an obvious difference between British and American).
2. The schwa used to be less frequent (10.74% according to Grimson), a difference due at least in part to change over time. Actually, I’m surprised the difference isn’t more, but maybe the schwa is less common in American English.
3. /d/ is ranked above /s/ on Grimson’s list. Not sure why.
4. Another notable difference is /ɪ/. Grimson has it at 8.33% compared to your 6.32%. Change over time?
5. /ʊ/ on your list is half that (0.43%) of Grimson’s (0.86%). Again, not sure why.
6. /w/ is much more frequent according to Grimson (2.81% vs. your 1.95%).

I’m thinking you may well have come across Grimson’s data in the course of thinking about how to compile your own. If so, I’d be interested in any comments you might have.

Finally, I noticed that both lists omit /i/ (sometimes known as the Spanish ‘i’), as at the end of words such as ‘happy’. I’m wondering why…

Reply
- CMLOEGCMLUIN says:
  
  04.20.2016 at 20:48
  
  Thank you for your detailed comment!
  
  I hadn’t come across Grimson’s data, as far as I recall, no. And while I am interested, I don’t have any further insight into the differences in our results than you’ve already surmised.
  
  I’m glad this post is of use to folks. Very little on this blog of mine can claim as much. That said, I did do this just for my own amusement; an academic level of accuracy was far down on my list of priorities.
  
  Reply
Alison says:

05.03.2016 at 16:28

This is a great project! I am surprised that the voiced interdental fricative is not higher up the list, as ‘the’ and the ‘th’ pronouns make up 10% of the top 100 high frequency words.

Reply
- David Rosson says:
  
  05.03.2016 at 18:18
  
  That’s word frequency, longer words have more phonemes each.
  
  Reply
Danny Cho says:

03.09.2017 at 11:53

What is that calligraphy in the background??

Reply
- CMLOEGCMLUIN says:
  
  03.10.2017 at 22:24
  
  That’s the design for my tattoo. It’s roman script, just sideways in a font I designed.
  
  Reply
Meychele Chesley says:

12.02.2017 at 13:38

Hello, I am working on a research paper for my senior design class at UCF. I came across the english phenomes by commoness table and found it very useful, and would like to use a screenshot of it in my paper as a reference. Would this be okay? I will be citing appropriately.

Reply
- CMLOEGCMLUIN says:
  
  12.02.2017 at 16:17
  
  No problem. Please just understand that my methodology was not particularly rigorous. I did this mostly to satisfy my own interest.
  
  Reply
Pandu Rendradjaja says:

05.29.2018 at 00:05

Hello! Thanks for this, I’ve found it really useful. I ended up needing to go a little bit further, so I ended up mostly-replicating and extending your work a bit. If you’re interested: https://github.com/prendradjaja/phoneme-frequencies

Reply
- CMLOEGCMLUIN says:
  
  06.01.2018 at 18:16
  
  Wow! Thank you for the reference. I think your results are probably superior given your inclusion of ɝ, eɪ, and ɔ. Also, thanks for sharing your code – I took my attempt at this before I learned to code and did it all in spreadsheet software. I’m glad to see that someone else wanted to answer the same question! Ultimately I’d love to see how this varies across different English dialects, or even other languages, perhaps generalized by articulation methods and positions so that we could get a sort of heat map across all human speech! But I doubt I’d ever get to it myself 🙂
  
  Reply
CMLOEGCMLUIN says:

08.13.2018 at 14:18

I’ve discovered that back in 1950 Rebecca E. Hayden published a similar study. This link was published in Dec. 2015, a few years after my original post: https://www.tandfonline.com/doi/pdf/10.1080/00437956.1950.11659381

For those of you looking for more academic approaches, her paper may fit your needs better than my informal one. Her results are fairly similar to mine.

Reply
Paul Ledak says:

03.01.2019 at 19:25

I was recently writing on the dichotomy of the high frequency of the ‘z” sound in the English language versus the relatively infrequent use of the actual letter. Ie. the use of “s” between vowels is a z sound [ex. rise] as well as in most plurals [plurals, cats] as well as possessives [boy’s]. My guess is that your study did take into account words which incorporate the z sounds generated by s used within words [rise, his…] but maybe not for plurals which in word frequency analysis are often reduced to an upcountry for the non-pluralized form. However, since plurals are very common, this omission would significantly understate the occurrence of the z phoneme in the English language. Do you agree?

Reply
- CMLOEGCMLUIN says:
  
  03.02.2019 at 08:55
  
  Paul,
  That is an important observation. It has been too long since I did the work for me to recall confidently, but I do claim to have used the *un*lemmatized frequency list, that is, with plurals, possessives, and other variant forms of words separated. So I believe the representation of the ‘z’ sound should be as accurate as anything else here, however accurate that may be.
  Thank you for your interest!
  Doug
  
  Reply
Rob Sheppard says:

03.30.2019 at 10:37

Hi there. This is fantastic! Can you clarify the difference between /e/ and /ɛ/ as reflected in this data set? Example words would be appreciated!

Reply
- CMLOEGCMLUIN says:
  
  03.31.2019 at 15:03
  
  Thank you Rob! I’m glad you like it.
  
  /ɛ/ is as in “bed”
  and I’m pretty sure when I wrote /e/ I meant the long version /eɪ/ as in “day”.
  
  Reply
英語発音の大前提「ə(シュワ)入門講座 | Run Language says:

06.16.2019 at 23:57

[…] (from Relative Frequencies of English Phonemes) […]

Reply
カタカナ発音には「ə(シュワ)」だ！驚くほどカッコイイ発音に! | Run Language says:

09.16.2019 at 08:38

[…] (from Relative Frequencies of English Phonemes) […]

Reply
Олександр Басій says:

04.24.2020 at 12:07

So, we can say that the “ɪ” sound is the most used one (beside the schwa sound as some zero-like one), if you adds up the “aɪ” sounds amount (analysed as some “a” plus “ɪ”).

Reply
- CMLOEGCMLUIN says:
  
  04.25.2020 at 10:36
  
  Thanks for the comment!
  
  I understand your suggestion. You have a reasonable point; you could make the case for [ɪ] being most common.
  
  I’m not sure if a professional linguist would agree; some differ on how they consider constituent phones of diphthongs, i.e. is the [ɪ] in [aɪ] the same [ɪ] as when it stands alone? Maybe, maybe not.
  
  It also depends on what question you’re trying to answer. If you’re trying to answer “which vowel sound divided by word breaks or consonant clusters is most common” then it doesn’t matter that [ɪ] is contained in [aɪ].
  
  You may also want to consider that [ɪ] often occurs in unstressed syllables too. It’s borderline “zero-like”, like the schwa.
  
  Reply
Patrick says:

07.21.2020 at 13:50

Amazing work here even after all these years. I’m wondering if you could explain why there are only 39 phonemes listed when English is generally regarded to have 42-44 (depending on pronunciation). Thanks!

Reply
- CMLOEGCMLUIN says:
  
  07.26.2020 at 10:34
  
  Thanks for your comment Patrick! It has been a long time, so I don’t remember the details of my decision making. This was a super informal project of mine. Per the link above, the CMU Pronouncing Dictionary uses only 39 phonemes: http://www.speech.cs.cmu.edu/cgi-bin/cmudict and I probably didn’t look into the matter any further than that.
  
  Reply
Priorities for pronunciation | englishglobalcom says:

10.11.2020 at 16:12

[…] Blumeyer, D. 2012. Relative frequencies of English Phonemes. Available online at https://cmloegcmluin.wordpress.com/2012/11/10/relative-frequencies-of-english-phonemes/ […]

Reply
Steven Lytle says:

09.09.2021 at 20:40

You have done a great job. Thanks for making it available. Do you have any information on the commonest spellings for each phoneme?

Reply
- CMLOEGCMLUIN says:
  
  09.11.2021 at 11:28
  
  Interesting question! Well, I just Googled and immediately found this, which looks reasonable enough: http://wp.auburn.edu/rdggenie/home/teaching-ideas/spcat/
  
  This may have been a good thing for me to consider in preparation for this other project of mine: https://cmloegcmluin.wordpress.com/2011/12/01/my-g-a-phonemic-transcription/
  
  Reply
Most common consonant sound (token frequency) - English Vision says:

10.13.2021 at 10:59

[…] to this, the most common consonant sound in the British National Corpus is /n/, closely followed by /r/ […]

Reply
Hedley Finger says:

05.25.2022 at 17:53

Can you run your software to produce the percentage frequencies of phoneme pairs and phoneme triples? I am designing a Shavian keyboard (QWERTY US/AU standard) to place the most frequently occurring single, double, and triple phonemes on the unshifted keys, attempting to alternate the components of doubles and triples on alternate hands. Shifted
Keys will be for infrequently encountered phonemes, punctuation, and simple arithmetic characters, etc.

Regards,
Hedley
hed ley . fin ger AT g mail . c om

Reply
- CMLOEGCMLUIN says:
  
  05.26.2022 at 08:45
  
  Sorry Hedley. This project was 10 years ago and so I don’t know where I’ve put the materials I used to work it out. Actually that was even before I became a software engineer so I probably just did it in some big spreadsheet. My process is described above though and it should be somewhat easily repeatable if you want to take it in a slightly different direction, which I encourage you to do. Good luck with your project. It sounds interesting!
  
  Reply