UNION: Merge of SUBTL and the CMU Proncounciation Dictionary -- Expanded


--- Home  ---  Tools ---

Download


UNION: Text and XLSX version [20 mb .zip]

Original SUBTL Word Frequency Database (txt and xls formats) [8 mb .zip]

Original CMU Pronounciation Dictionary v 0.6 Syllabified [1 mb .zip]

Original OLD20 and PLD20 from the English Lexicon Project [0.5 mb .zip]


Description


The UNION database was created by taking the union of all of the words in the SUBTL word frequency norms with frequencies greater than or equal to 1 (Brysbaert & New, 2009) with those of all of the words with pronounciations in a syllabified veresion of the CMU pronounciation dictionary (Bartlett, Kondrak, & Cherry, 2009).  Because of the exhaustiveness of the CMU pronounciation dictionary, very few items (< 1000) were removed because they were not in the CMU pronounciation dictionary.  

NEW DATA: In addition to the word frequency and pronounciation data, several additional values were calculated for all of the words in UNION, as follows:

  • phonology    phonological coding of word, with stress delimiters
  • PhonOneCharPerPhonemeNoStress    modified version of the phonology coding, above, using a modified coding scheme so that each phoneme is coded by one character.  See phonology_coding_key for details
  • isIllegal    does the item contain illegal characters
  • isDuplicate    are there duplicates of the item
  • isHomograph    does the item have homographs (see note, below)
  • isHomophone    does the item have homophones (see note, below)
  • nSyll    number of syllables
  • nPhon    number of phonemes
  • delete    Can Delete
  • lowerMostFreq    is the lowercase interpretation the most frequent
  • SUBTLwf    SUBTL word frequency
  • nLet    length in letters
  • isIllegal    (duplicate of above)
  • pld20    phonological levenshtein difference fo rall items that were also encluded in the English Lexicon Project, from Yarkoni's website
  • old20    orthographic levenshtein difference fo rall items that were also encluded in the English Lexicon Project, from Yarkoni's website
  • posBigram    positional bigram frequency, calculated for all types with frequencies greater than or equal to 1.  Counts are length-specific
  • legalBigram    does the item only contain legal bigrams
  • posUni    positional unigram (letter) frequency, calculated for all types with frequencies greater than or equal to 1.  Counts are length-specific
  • legalUni    does the word only contain legal unigrams
  • coltNOrth    coltheart N for orthographic neighbours
  • orthNeighbours    list of orthographic neighbours
  • coltNPhon    coltheart N for phonological neighbours
  • phonNeighbours    list of phonological neighbours using the one letter per phoneme coding scheme (see above)
  • NphonOnsetNeighbour    the number of phonological onset neighbours that differ only by their first phoneme
  • phonOnsetNeighbours    list of phonological onset neighbours
  • NphonOffsetNeighbours    number of phonological offset neighbours that differ only by their last phoneme
  • PhonOffsetNeighbours    list of phonological offset neighbours

 

Additional notes: 

The full lists of English homophones and homonyms are available on the main tools webpage.

CMU Pronounciation Dictionary: The database was cleaned to remove the pronounciation of puncuation characters (e.g., !Exclamation-point).

The original databases are available as follows:

SUBTL:

http://subtlexus.lexique.org/

CMU Pronounciation Dictionary v 0.6 Syllabified:

http://webdocs.cs.ualberta.ca/~kondrak/cmudict.html

CMU Pronounciation Dictionary:

http://www.speech.cs.cmu.edu/cgi-bin/cmudict


This databases are associated with the following articles:

Brysbaert, M., & New, B. (2009).  Moving beyond Kucera and Francis: A critical evaluation of current frequency norms and the introduction of a new and improved word frequency measure for American english.  Behavior Research Methods, 41(4), 977-990.  

Bartlett, S., Kondrak, G., & Cherry, C.  (2009).  On the syllabification of phonemes.  NAACL-HLT 2009.



The information included here was compiled on May 20th, 2013. 



Copyright Notice:

The information provided here is intended to ensure the timely dissemination of the EsPal data in an alternative format that may be useful for non-commercial academic research.  Copyright of all of this material is maintained by the original authors or other copyright holders, and it is assumed that all users of these data will adhere to these copyrights. 



Blair Armstrong, 2011-