Tools --- UNION

UNION: Merge of SUBTL and the CMU Proncounciation Dictionary -- Expanded

--- Home --- Tools ---

Download

UNION: Text and XLSX version [20 mb .zip]

Original SUBTL Word Frequency Database (txt and xls formats) [8 mb .zip]

Original CMU Pronounciation Dictionary v 0.6 Syllabified [1 mb .zip]

Original OLD20 and PLD20 from the English Lexicon Project [0.5 mb .zip]

Description

The UNION database was created by taking the union of all of the words in the SUBTL word frequency norms with frequencies greater than or equal to 1 (Brysbaert & New, 2009) with those of all of the words with pronounciations in a syllabified veresion of the CMU pronounciation dictionary (Bartlett, Kondrak, & Cherry, 2009). Because of the exhaustiveness of the CMU pronounciation dictionary, very few items (< 1000) were removed because they were not in the CMU pronounciation dictionary.

NEW DATA: In addition to the word frequency and pronounciation data, several additional values were calculated for all of the words in UNION, as follows:

phonology phonological coding of word, with stress delimiters
PhonOneCharPerPhonemeNoStress modified version of the phonology coding, above, using a modified coding scheme so that each phoneme is coded by one character. See phonology_coding_key for details
isIllegal does the item contain illegal characters
isDuplicate are there duplicates of the item
isHomograph does the item have homographs (see note, below)
isHomophone does the item have homophones (see note, below)
nSyll number of syllables
nPhon number of phonemes
delete Can Delete
lowerMostFreq is the lowercase interpretation the most frequent
SUBTLwf SUBTL word frequency
nLet length in letters
isIllegal (duplicate of above)
pld20 phonological levenshtein difference fo rall items that were also encluded in the English Lexicon Project, from Yarkoni's website
old20 orthographic levenshtein difference fo rall items that were also encluded in the English Lexicon Project, from Yarkoni's website
posBigram positional bigram frequency, calculated for all types with frequencies greater than or equal to 1. Counts are length-specific
legalBigram does the item only contain legal bigrams
posUni positional unigram (letter) frequency, calculated for all types with frequencies greater than or equal to 1. Counts are length-specific
legalUni does the word only contain legal unigrams
coltNOrth coltheart N for orthographic neighbours
orthNeighbours list of orthographic neighbours
coltNPhon coltheart N for phonological neighbours
phonNeighbours list of phonological neighbours using the one letter per phoneme coding scheme (see above)
NphonOnsetNeighbour the number of phonological onset neighbours that differ only by their first phoneme
phonOnsetNeighbours list of phonological onset neighbours
NphonOffsetNeighbours number of phonological offset neighbours that differ only by their last phoneme
PhonOffsetNeighbours list of phonological offset neighbours

Additional notes:

The full lists of English homophones and homonyms are available on the main tools webpage.

CMU Pronounciation Dictionary: The database was cleaned to remove the pronounciation of puncuation characters (e.g., !Exclamation-point).

The original databases are available as follows:

SUBTL:

http://subtlexus.lexique.org/

CMU Pronounciation Dictionary v 0.6 Syllabified:

http://webdocs.cs.ualberta.ca/~kondrak/cmudict.html

CMU Pronounciation Dictionary:

http://www.speech.cs.cmu.edu/cgi-bin/cmudict

This databases are associated with the following articles:

Brysbaert, M., & New, B. (2009). Moving beyond Kucera and Francis: A critical evaluation of current frequency norms and the introduction of a new and improved word frequency measure for American english. Behavior Research Methods, 41(4), 977-990.

Bartlett, S., Kondrak, G., & Cherry, C. (2009). On the syllabification of phonemes. NAACL-HLT 2009.

The information included here was compiled on May 20th, 2013.

The information provided here is intended to ensure the timely dissemination of the EsPal data in an alternative format that may be useful for non-commercial academic research. Copyright of all of this material is maintained by the original authors or other copyright holders, and it is assumed that all users of these data will adhere to these copyrights.

Blair Armstrong, 2011-