UNION: Merge of SUBTL and the CMU Proncounciation Dictionary -- Expanded
Download
UNION: Text and XLSX version [20 mb .zip]
Original SUBTL Word Frequency Database (txt and xls formats) [8 mb .zip]
Original CMU Pronounciation Dictionary v 0.6 Syllabified [1 mb .zip]
Original OLD20 and PLD20 from the English Lexicon Project [0.5 mb .zip]
Description
The UNION database was created by taking the union of all of the words in the SUBTL word frequency norms with frequencies greater than or equal to 1 (Brysbaert & New, 2009) with those of all of the words with pronounciations in a syllabified veresion of the CMU pronounciation dictionary (Bartlett, Kondrak, & Cherry, 2009). Because of the exhaustiveness of the CMU pronounciation dictionary, very few items (< 1000) were removed because they were not in the CMU pronounciation dictionary.
NEW DATA: In addition to the word frequency and pronounciation data, several additional values were calculated for all of the words in UNION, as follows:
- phonology phonological coding of word, with stress delimiters
- PhonOneCharPerPhonemeNoStress modified version of the phonology coding, above, using a modified coding scheme so that each phoneme is coded by one character. See phonology_coding_key for details
- isIllegal does the item contain illegal characters
- isDuplicate are there duplicates of the item
- isHomograph does the item have homographs (see note, below)
- isHomophone does the item have homophones (see note, below)
- nSyll number of syllables
- nPhon number of phonemes
- delete Can Delete
- lowerMostFreq is the lowercase interpretation the most frequent
- SUBTLwf SUBTL word frequency
- nLet length in letters
- isIllegal (duplicate of above)
- pld20 phonological levenshtein difference fo rall items that were also encluded in the English Lexicon Project, from Yarkoni's website
- old20 orthographic levenshtein difference fo rall items that were also encluded in the English Lexicon Project, from Yarkoni's website
- posBigram positional bigram frequency, calculated for all types with frequencies greater than or equal to 1. Counts are length-specific
- legalBigram does the item only contain legal bigrams
- posUni positional unigram (letter) frequency, calculated for all types with frequencies greater than or equal to 1. Counts are length-specific
- legalUni does the word only contain legal unigrams
- coltNOrth coltheart N for orthographic neighbours
- orthNeighbours list of orthographic neighbours
- coltNPhon coltheart N for phonological neighbours
- phonNeighbours list of phonological neighbours using the one letter per phoneme coding scheme (see above)
- NphonOnsetNeighbour the number of phonological onset neighbours that differ only by their first phoneme
- phonOnsetNeighbours list of phonological onset neighbours
- NphonOffsetNeighbours number of phonological offset neighbours that differ only by their last phoneme
- PhonOffsetNeighbours list of phonological offset neighbours
Additional notes:
The full lists of English homophones and homonyms are available on the main tools webpage.
CMU Pronounciation Dictionary: The database was cleaned to remove the pronounciation of puncuation characters (e.g., !Exclamation-point).
The original databases are available as follows:
SUBTL:
CMU Pronounciation Dictionary v 0.6 Syllabified:
http://webdocs.cs.ualberta.ca/~kondrak/cmudict.html
CMU Pronounciation Dictionary:
http://www.speech.cs.cmu.edu/cgi-bin/cmudict
This databases are associated with the following articles:
Brysbaert, M., & New, B. (2009). Moving beyond Kucera and Francis: A critical evaluation of current frequency norms and the introduction of a new and improved word frequency measure for American english. Behavior Research Methods, 41(4), 977-990.
Bartlett, S., Kondrak, G., & Cherry, C. (2009). On the syllabification of phonemes. NAACL-HLT 2009.
The information included here was compiled on May 20th, 2013.
Copyright Notice:
The information provided here is intended to ensure the timely dissemination of the EsPal data in an alternative format that may be useful for non-commercial academic research. Copyright of all of this material is maintained by the original authors or other copyright holders, and it is assumed that all users of these data will adhere to these copyrights.