Relative Frequencies of English Phonemes

11.10.2012 § 21 Comments

I’ve roughly estimated the relative frequencies that English phonemes get used.

I did so by correlating the Carnegie Mellon University Pronouncing Dictionary with Adam Kilgarriff’s unlemmatized frequency list for the British National Corpus. I used the former as a phonemic lexicon and the latter as a sample set, weighting the phonemes by the totals of the relative frequencies of each of the words they appear in.

My methodology was messy. CMU’s Pronouncing Dictionary conflated schwa with the near-open central vowel, and had several noticeable errors. The BNC list had multiple entries different parts of speech of words, and formatting issues prevented me from using any words with accents or apostrophes, including common contractions (though I found that the counts were so unreasonably low for these entries, that Kilgarriff must have split all of them up already except for the ones in the spoken ‘demog’ component). I believe my manual error checking on the top few hundred words (accounting for 47 million of the 81 million total word usage instances) helped significantly.

So here you have it:

English Phonemes by Commonness

ə / 11.49%
n / 7.11%
r / 6.94%
t / 6.91%
ɪ / 6.32%
s / 4.75%
d / 4.21%
l / 3.96%
i / 3.61%
k / 3.18%
ð / 2.95%
ɛ / 2.86%
m / 2.76%
z / 2.76%
p / 2.15%
æ / 2.10%
v / 2.01%
w / 1.95%
u / 1.93%
b / 1.80%
e / 1.79%
ʌ / 1.74%
f / 1.71%
aɪ / 1.50%
ɑ / 1.45%
h / 1.40%
o / 1.25%
ɒ / 1.18%
ŋ / 0.99%
ʃ / 0.97%
y / 0.81%
g / 0.80%
dʒ / 0.59%
tʃ / 0.56%
aʊ / 0.50%
ʊ / 0.43%
θ / 0.41%
ɔɪ / 0.10%
ʒ / 0.07%

Advertisements

§ 21 Responses to Relative Frequencies of English Phonemes

  • Chris says:

    Thanks so much. I understand your comments about “messiness”, but a thing doesn’t have to be perfect to be good.

    • CMLOEGCMLUIN says:

      Thanks! I’m delighted that this saved you some effort, was worth something to you, and also that you wondered something I had wondered in the first place!

  • Kevin says:

    This is a huge help for a project that I am working- Does this take into account an American accent? ie how in certain settings T and D pronunciation are switched? or instances in which glottal stops are most commonly used? etc-etc. if not do you have ideas for how I might take into account American Pronunciation variation (assuming that I generate a “standardized” American pronuncation)

    • CMLOEGCMLUIN says:

      As I recall, the data does use an American accent, because it is an American university’s phonological data (only the sample set of word usage is British).

      As for d/t/ʔ, unfortunately I don’t remember the pronouncing dictionary taking into account the glottal stop and its (as far as I understand) unique situation of being poly-phonemic, that is, that both d and t when unstressed can become it. Therefore I have no information on it, and it is quite likely that more accurate examination would pull both t and d down on this chart and situate a ʔ fairly high up!

      The links I’ve included to both the corpus and the dictionary appear to be still functioning and both downloads are free. Feel free to take my strategy and use your own approach/tweaks.

      Glad this helped!

  • David Rosson says:

    What were these noticeable errors?
    What did you do with the “manual error checking”?

    • CMLOEGCMLUIN says:

      I don’t recall specifically. Something like “what” being written as if it were pronounced like “wet”, etc. My manual checking was just going through the first 100 most common words and correcting any such errors.

      • David Rosson says:

        At least “what” seems alright:
        WHAT W AH1 T
        WHAT(2) HH W AH1 T

        Does your coding table agree with this?
        AA ɑː
        AA0 ɑː
        AA1 ɑː
        AA2 ɑː
        AE æ
        AE0 æ
        AE1 æ
        AE2 æ
        AH ə
        AH0 ə
        AH1 ʌ
        AH2 ʌ
        AO ɔː
        AO0 ɔː
        AO1 ɔː
        AO2 ɔː
        AW aʊ
        AW0 aʊ
        AW1 aʊ
        AW2 aʊ
        AY aɪ
        AY0 aɪ
        AY1 aɪ
        AY2 aɪ
        B b
        CH ʧ
        D d
        DH ð
        EH ɛ
        EH0 ɛ
        EH1 ɛ
        EH2 ɛ
        ER ɚ
        ER0 ɝ
        ER1 ɝ
        ER2 ɝ
        EY eɪ
        EY0 eɪ
        EY1 eɪ
        EY2 eɪ
        F f
        G ɡ
        HH h
        IH ɪ
        IH0 ɪ
        IH1 ɪ
        IH2 ɪ
        IY iː
        IY0 iː
        IY1 iː
        IY2 iː
        JH ʤ
        K k
        L l
        M m
        N n
        NG ŋ
        OW oʊ
        OW0 oʊ
        OW1 oʊ
        OW2 oʊ
        OY ɔɪ
        OY0 ɔɪ
        OY1 ɔɪ
        OY2 ɔɪ
        P p
        R r
        S s
        SH ʃ
        T t
        TH θ
        UH ʊ
        UH0 ʊ
        UH1 ʊ
        UH2 ʊ
        UW uː
        UW0 uː
        UW1 uː
        UW2 uː
        V v
        W w
        Y j
        Z z
        ZH ʒ

      • CMLOEGCMLUIN says:

        1) I honestly can’t remember or find a record of how I handled ə. It’s possible that I changed all unstressed vowels to ə. Otherwise, I would have done as you have here, changed only AH0 to it. Obviously this could make a huge difference.
        2) My CMU pronunciation dictionary did not have a concept of vowels without primary, secondary, or no stress. That is, I only have XX0, XX1, XX2… I don’t have plain XX. Not sure what that would represent.
        3) I wrote both oʊ and ɔ as o. I only get ɔ before r’s and don’t distinguish.
        4) For ER I did not use ɚ or ɝ
; I used ər. I know it may not be the most accurate.
        5) Just in terms of how I labeled stuff above, I didn’t include length (ː)

  • David Rosson says:

    What I was looking for was a frequency distribution for syllables, onsets, and rhymes respectively (using a syllabified CMU dataset: http://webdocs.cs.ualberta.ca/~kondrak/cmudict.html).

  • Pam says:

    Can you point me to an article I can reference for this material?

    • CMLOEGCMLUIN says:

      Pam,

      I’m very sorry — I should have been more explicit that this is not an academic project. I never expected it would get this much attention online — I mostly did it out of my own curiosity! It’s only a “rough estimate” using the CMU pronouncing dictionary, Kilgarriff’s freq list, and the BNC per the above links — plus my “messy methodology”. If you want to cite my conclusions for science (I see you’re a speech therapist?), unfortunately, a more rigorous look will need to be taken. Let’s stay in touch here, though, in case either of us do come across something like that!

      Best,
      Doug

  • James says:

    There is a bit of an oversight here depending on the purpose of constructing your list.

    This oversight is that you’ve used a dictionary rather than a corpus to determine occurrence.

    Here’s an example of what I mean:
    If I’m using dictionary, I’ll find that the word “the” occurs once, so “the” will contribute the sound ð only once. Meanwhile, if I use a spoken corpus, the word “the” will occur about 3.5% of the time, and will thus contribute the sound ð many times.

    A better methodology is to assemble a large corpus, perhaps 10,000 words or more, of written or spoken dialogue, then use phonetics library to convert that to IPA, then tally the occurrence of each phoneme.

    Cheers!

    • CMLOEGCMLUIN says:

      James,
      Thank you for your comment. I did actually use the British National Corpus for occurrence, for this very reason!

  • This is really useful for teachers and learners of English so thanks very much for posting it. There is another list, but it dates from around 1980 (see A.C.Grimson, ‘An Introduction to the Pronunciation of English’). Although the lists are broadly similar in terms of rank, there are a number of differences, some undoubtedly due to changes over the last 30 or 40 years. Also, I think Grimson’s list is for British pronunciation (I don’t have the original book, just the table). Main points:

    1. The biggest difference is /r/ (3.51% vs. your 6.94%, an obvious difference between British and American).
    2. The schwa used to be less frequent (10.74% according to Grimson), a difference due at least in part to change over time. Actually, I’m surprised the difference isn’t more, but maybe the schwa is less common in American English.
    3. /d/ is ranked above /s/ on Grimson’s list. Not sure why.
    4. Another notable difference is /ɪ/. Grimson has it at 8.33% compared to your 6.32%. Change over time?
    5. /ʊ/ on your list is half that (0.43%) of Grimson’s (0.86%). Again, not sure why.
    6. /w/ is much more frequent according to Grimson (2.81% vs. your 1.95%).

    I’m thinking you may well have come across Grimson’s data in the course of thinking about how to compile your own. If so, I’d be interested in any comments you might have.

    Finally, I noticed that both lists omit /i/ (sometimes known as the Spanish ‘i’), as at the end of words such as ‘happy’. I’m wondering why…

    • CMLOEGCMLUIN says:

      Thank you for your detailed comment!

      I hadn’t come across Grimson’s data, as far as I recall, no. And while I am interested, I don’t have any further insight into the differences in our results than you’ve already surmised.

      I’m glad this post is of use to folks. Very little on this blog of mine can claim as much. That said, I did do this just for my own amusement; an academic level of accuracy was far down on my list of priorities.

  • Alison says:

    This is a great project! I am surprised that the voiced interdental fricative is not higher up the list, as ‘the’ and the ‘th’ pronouns make up 10% of the top 100 high frequency words.

  • Danny Cho says:

    What is that calligraphy in the background??

    • CMLOEGCMLUIN says:

      That’s the design for my tattoo. It’s roman script, just sideways in a font I designed.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

What’s this?

You are currently reading Relative Frequencies of English Phonemes at cmloegcmluin.

meta

%d bloggers like this: