Relative Frequencies of English Phonemes

11.10.2012 § 43 Comments

I’ve roughly estimated the relative frequencies that English phonemes get used.

I did so by correlating the Carnegie Mellon University Pronouncing Dictionary with Adam Kilgarriff’s unlemmatized frequency list for the British National Corpus. I used the former as a phonemic lexicon and the latter as a sample set, weighting the phonemes by the totals of the relative frequencies of each of the words they appear in.

My methodology was messy. CMU’s Pronouncing Dictionary conflated schwa with the near-open central vowel, and had several noticeable errors. The BNC list had multiple entries different parts of speech of words, and formatting issues prevented me from using any words with accents or apostrophes, including common contractions (though I found that the counts were so unreasonably low for these entries, that Kilgarriff must have split all of them up already except for the ones in the spoken ‘demog’ component). I believe my manual error checking on the top few hundred words (accounting for 47 million of the 81 million total word usage instances) helped significantly.

So here you have it:

English Phonemes by Commonness

ə / 11.49%
n / 7.11%
r / 6.94%
t / 6.91%
ɪ / 6.32%
s / 4.75%
d / 4.21%
l / 3.96%
i / 3.61%
k / 3.18%
ð / 2.95%
ɛ / 2.86%
m / 2.76%
z / 2.76%
p / 2.15%
æ / 2.10%
v / 2.01%
w / 1.95%
u / 1.93%
b / 1.80%
e / 1.79%
ʌ / 1.74%
f / 1.71%
aɪ / 1.50%
ɑ / 1.45%
h / 1.40%
o / 1.25%
ɒ / 1.18%
ŋ / 0.99%
ʃ / 0.97%
j / 0.81%
g / 0.80%
dʒ / 0.59%
tʃ / 0.56%
aʊ / 0.50%
ʊ / 0.43%
θ / 0.41%
ɔɪ / 0.10%
ʒ / 0.07%

§ 43 Responses to Relative Frequencies of English Phonemes

  • Chris says:

    Thanks so much. I understand your comments about “messiness”, but a thing doesn’t have to be perfect to be good.

    • CMLOEGCMLUIN says:

      Thanks! I’m delighted that this saved you some effort, was worth something to you, and also that you wondered something I had wondered in the first place!

  • Kevin says:

    This is a huge help for a project that I am working- Does this take into account an American accent? ie how in certain settings T and D pronunciation are switched? or instances in which glottal stops are most commonly used? etc-etc. if not do you have ideas for how I might take into account American Pronunciation variation (assuming that I generate a “standardized” American pronuncation)

    • CMLOEGCMLUIN says:

      As I recall, the data does use an American accent, because it is an American university’s phonological data (only the sample set of word usage is British).

      As for d/t/ʔ, unfortunately I don’t remember the pronouncing dictionary taking into account the glottal stop and its (as far as I understand) unique situation of being poly-phonemic, that is, that both d and t when unstressed can become it. Therefore I have no information on it, and it is quite likely that more accurate examination would pull both t and d down on this chart and situate a ʔ fairly high up!

      The links I’ve included to both the corpus and the dictionary appear to be still functioning and both downloads are free. Feel free to take my strategy and use your own approach/tweaks.

      Glad this helped!

  • David Rosson says:

    What were these noticeable errors?
    What did you do with the “manual error checking”?

    • CMLOEGCMLUIN says:

      I don’t recall specifically. Something like “what” being written as if it were pronounced like “wet”, etc. My manual checking was just going through the first 100 most common words and correcting any such errors.

      • David Rosson says:

        At least “what” seems alright:
        WHAT W AH1 T
        WHAT(2) HH W AH1 T

        Does your coding table agree with this?
        AA ɑː
        AA0 ɑː
        AA1 ɑː
        AA2 ɑː
        AE æ
        AE0 æ
        AE1 æ
        AE2 æ
        AH ə
        AH0 ə
        AH1 ʌ
        AH2 ʌ
        AO ɔː
        AO0 ɔː
        AO1 ɔː
        AO2 ɔː
        AW aʊ
        AW0 aʊ
        AW1 aʊ
        AW2 aʊ
        AY aɪ
        AY0 aɪ
        AY1 aɪ
        AY2 aɪ
        B b
        CH ʧ
        D d
        DH ð
        EH ɛ
        EH0 ɛ
        EH1 ɛ
        EH2 ɛ
        ER ɚ
        ER0 ɝ
        ER1 ɝ
        ER2 ɝ
        EY eɪ
        EY0 eɪ
        EY1 eɪ
        EY2 eɪ
        F f
        G ɡ
        HH h
        IH ɪ
        IH0 ɪ
        IH1 ɪ
        IH2 ɪ
        IY iː
        IY0 iː
        IY1 iː
        IY2 iː
        JH ʤ
        K k
        L l
        M m
        N n
        NG ŋ
        OW oʊ
        OW0 oʊ
        OW1 oʊ
        OW2 oʊ
        OY ɔɪ
        OY0 ɔɪ
        OY1 ɔɪ
        OY2 ɔɪ
        P p
        R r
        S s
        SH ʃ
        T t
        TH θ
        UH ʊ
        UH0 ʊ
        UH1 ʊ
        UH2 ʊ
        UW uː
        UW0 uː
        UW1 uː
        UW2 uː
        V v
        W w
        Y j
        Z z
        ZH ʒ

      • CMLOEGCMLUIN says:

        1) I honestly can’t remember or find a record of how I handled ə. It’s possible that I changed all unstressed vowels to ə. Otherwise, I would have done as you have here, changed only AH0 to it. Obviously this could make a huge difference.
        2) My CMU pronunciation dictionary did not have a concept of vowels without primary, secondary, or no stress. That is, I only have XX0, XX1, XX2… I don’t have plain XX. Not sure what that would represent.
        3) I wrote both oʊ and ɔ as o. I only get ɔ before r’s and don’t distinguish.
        4) For ER I did not use ɚ or ɝ
; I used ər. I know it may not be the most accurate.
        5) Just in terms of how I labeled stuff above, I didn’t include length (ː)

  • David Rosson says:

    What I was looking for was a frequency distribution for syllables, onsets, and rhymes respectively (using a syllabified CMU dataset: http://webdocs.cs.ualberta.ca/~kondrak/cmudict.html).

  • Pam says:

    Can you point me to an article I can reference for this material?

    • CMLOEGCMLUIN says:

      Pam,

      I’m very sorry — I should have been more explicit that this is not an academic project. I never expected it would get this much attention online — I mostly did it out of my own curiosity! It’s only a “rough estimate” using the CMU pronouncing dictionary, Kilgarriff’s freq list, and the BNC per the above links — plus my “messy methodology”. If you want to cite my conclusions for science (I see you’re a speech therapist?), unfortunately, a more rigorous look will need to be taken. Let’s stay in touch here, though, in case either of us do come across something like that!

      Best,
      Doug

  • James says:

    There is a bit of an oversight here depending on the purpose of constructing your list.

    This oversight is that you’ve used a dictionary rather than a corpus to determine occurrence.

    Here’s an example of what I mean:
    If I’m using dictionary, I’ll find that the word “the” occurs once, so “the” will contribute the sound ð only once. Meanwhile, if I use a spoken corpus, the word “the” will occur about 3.5% of the time, and will thus contribute the sound ð many times.

    A better methodology is to assemble a large corpus, perhaps 10,000 words or more, of written or spoken dialogue, then use phonetics library to convert that to IPA, then tally the occurrence of each phoneme.

    Cheers!

    • CMLOEGCMLUIN says:

      James,
      Thank you for your comment. I did actually use the British National Corpus for occurrence, for this very reason!

  • This is really useful for teachers and learners of English so thanks very much for posting it. There is another list, but it dates from around 1980 (see A.C.Grimson, ‘An Introduction to the Pronunciation of English’). Although the lists are broadly similar in terms of rank, there are a number of differences, some undoubtedly due to changes over the last 30 or 40 years. Also, I think Grimson’s list is for British pronunciation (I don’t have the original book, just the table). Main points:

    1. The biggest difference is /r/ (3.51% vs. your 6.94%, an obvious difference between British and American).
    2. The schwa used to be less frequent (10.74% according to Grimson), a difference due at least in part to change over time. Actually, I’m surprised the difference isn’t more, but maybe the schwa is less common in American English.
    3. /d/ is ranked above /s/ on Grimson’s list. Not sure why.
    4. Another notable difference is /ɪ/. Grimson has it at 8.33% compared to your 6.32%. Change over time?
    5. /ʊ/ on your list is half that (0.43%) of Grimson’s (0.86%). Again, not sure why.
    6. /w/ is much more frequent according to Grimson (2.81% vs. your 1.95%).

    I’m thinking you may well have come across Grimson’s data in the course of thinking about how to compile your own. If so, I’d be interested in any comments you might have.

    Finally, I noticed that both lists omit /i/ (sometimes known as the Spanish ‘i’), as at the end of words such as ‘happy’. I’m wondering why…

    • CMLOEGCMLUIN says:

      Thank you for your detailed comment!

      I hadn’t come across Grimson’s data, as far as I recall, no. And while I am interested, I don’t have any further insight into the differences in our results than you’ve already surmised.

      I’m glad this post is of use to folks. Very little on this blog of mine can claim as much. That said, I did do this just for my own amusement; an academic level of accuracy was far down on my list of priorities.

  • Alison says:

    This is a great project! I am surprised that the voiced interdental fricative is not higher up the list, as ‘the’ and the ‘th’ pronouns make up 10% of the top 100 high frequency words.

  • Danny Cho says:

    What is that calligraphy in the background??

    • CMLOEGCMLUIN says:

      That’s the design for my tattoo. It’s roman script, just sideways in a font I designed.

  • Meychele Chesley says:

    Hello, I am working on a research paper for my senior design class at UCF. I came across the english phenomes by commoness table and found it very useful, and would like to use a screenshot of it in my paper as a reference. Would this be okay? I will be citing appropriately.

    • CMLOEGCMLUIN says:

      No problem. Please just understand that my methodology was not particularly rigorous. I did this mostly to satisfy my own interest.

  • Hello! Thanks for this, I’ve found it really useful. I ended up needing to go a little bit further, so I ended up mostly-replicating and extending your work a bit. If you’re interested: https://github.com/prendradjaja/phoneme-frequencies

    • CMLOEGCMLUIN says:

      Wow! Thank you for the reference. I think your results are probably superior given your inclusion of ɝ, eɪ, and ɔ. Also, thanks for sharing your code – I took my attempt at this before I learned to code and did it all in spreadsheet software. I’m glad to see that someone else wanted to answer the same question! Ultimately I’d love to see how this varies across different English dialects, or even other languages, perhaps generalized by articulation methods and positions so that we could get a sort of heat map across all human speech! But I doubt I’d ever get to it myself 🙂

  • CMLOEGCMLUIN says:

    I’ve discovered that back in 1950 Rebecca E. Hayden published a similar study. This link was published in Dec. 2015, a few years after my original post: https://www.tandfonline.com/doi/pdf/10.1080/00437956.1950.11659381

    For those of you looking for more academic approaches, her paper may fit your needs better than my informal one. Her results are fairly similar to mine.

  • Paul Ledak says:

    I was recently writing on the dichotomy of the high frequency of the ‘z” sound in the English language versus the relatively infrequent use of the actual letter. Ie. the use of “s” between vowels is a z sound [ex. rise] as well as in most plurals [plurals, cats] as well as possessives [boy’s]. My guess is that your study did take into account words which incorporate the z sounds generated by s used within words [rise, his…] but maybe not for plurals which in word frequency analysis are often reduced to an upcountry for the non-pluralized form. However, since plurals are very common, this omission would significantly understate the occurrence of the z phoneme in the English language. Do you agree?

    • CMLOEGCMLUIN says:

      Paul,
      That is an important observation. It has been too long since I did the work for me to recall confidently, but I do claim to have used the *un*lemmatized frequency list, that is, with plurals, possessives, and other variant forms of words separated. So I believe the representation of the ‘z’ sound should be as accurate as anything else here, however accurate that may be.
      Thank you for your interest!
      Doug

  • Rob Sheppard says:

    Hi there. This is fantastic! Can you clarify the difference between /e/ and /ɛ/ as reflected in this data set? Example words would be appreciated!

    • CMLOEGCMLUIN says:

      Thank you Rob! I’m glad you like it.

      /ɛ/ is as in “bed”
      and I’m pretty sure when I wrote /e/ I meant the long version /eɪ/ as in “day”.

  • […] (from Relative Frequencies of English Phonemes) […]

  • Олександр Басій says:

    So, we can say that the “ɪ” sound is the most used one (beside the schwa sound as some zero-like one), if you adds up the “aɪ” sounds amount (analysed as some “a” plus “ɪ”).

    • CMLOEGCMLUIN says:

      Thanks for the comment!

      I understand your suggestion. You have a reasonable point; you could make the case for [ɪ] being most common.

      I’m not sure if a professional linguist would agree; some differ on how they consider constituent phones of diphthongs, i.e. is the [ɪ] in [aɪ] the same [ɪ] as when it stands alone? Maybe, maybe not.

      It also depends on what question you’re trying to answer. If you’re trying to answer “which vowel sound divided by word breaks or consonant clusters is most common” then it doesn’t matter that [ɪ] is contained in [aɪ].

      You may also want to consider that [ɪ] often occurs in unstressed syllables too. It’s borderline “zero-like”, like the schwa.

  • Patrick says:

    Amazing work here even after all these years. I’m wondering if you could explain why there are only 39 phonemes listed when English is generally regarded to have 42-44 (depending on pronunciation). Thanks!

    • CMLOEGCMLUIN says:

      Thanks for your comment Patrick! It has been a long time, so I don’t remember the details of my decision making. This was a super informal project of mine. Per the link above, the CMU Pronouncing Dictionary uses only 39 phonemes: http://www.speech.cs.cmu.edu/cgi-bin/cmudict and I probably didn’t look into the matter any further than that.

  • Steven Lytle says:

    You have done a great job. Thanks for making it available. Do you have any information on the commonest spellings for each phoneme?

  • […] to this, the most common consonant sound in the British National Corpus is /n/, closely followed by /r/ […]

  • Hedley Finger says:

    Can you run your software to produce the percentage frequencies of phoneme pairs and phoneme triples? I am designing a Shavian keyboard (QWERTY US/AU standard) to place the most frequently occurring single, double, and triple phonemes on the unshifted keys, attempting to alternate the components of doubles and triples on alternate hands. Shifted
    Keys will be for infrequently encountered phonemes, punctuation, and simple arithmetic characters, etc.

    Regards,
    Hedley
    hed ley . fin ger AT g mail . c om

    • CMLOEGCMLUIN says:

      Sorry Hedley. This project was 10 years ago and so I don’t know where I’ve put the materials I used to work it out. Actually that was even before I became a software engineer so I probably just did it in some big spreadsheet. My process is described above though and it should be somewhat easily repeatable if you want to take it in a slightly different direction, which I encourage you to do. Good luck with your project. It sounds interesting!

Leave a comment

What’s this?

You are currently reading Relative Frequencies of English Phonemes at cmloegcmluin.

meta