Tools
This page contains a broad set of tools that I have found useful in my research. Both tools that I have created myself and tools that have been shared with me by other researchers are included here for the purpose of rapid dissemination from a central location.
Unless otherwise noted, all of the materials that I have created from the ground up are released under the GPL v3 and the Creative Commons licences (CC By 2.0). The rest remain the property of the respective owners.
List of Programs, Scripts, and Norms
PUBLISHED SOFTWARE AND NORMS:
- Chronset: An Automated Tool for Detecting Speech Onset [WEBSITE LINK]
- SOS: Software and Algorithm for the Stochastic Optimization of Stimuli [WEBSITE LINK]
- eDom: Software and Norms for 443 English Homonyms [WEBSITE LINK]
Norms
ENGLISH:
- UNION: (English, USA). Merged of SUBTL and the CMU Pronounciation Dicationary -- Expanded
- Homophones and Homographs from UNION (Merged SUBTL and CMU Pronounciation Dictionary)
- Number of Meanings, Number of Senses, and Part-of-speech frequencies from the Wordsmyth Online Dictionary
- Wall Street Journal Word Frequency Counts
- Archive
of University of Alberta Norms of Relative Meaning Frequency for 566
Homographs by Twilley, Dixon, Taylor, & Clark (1994)
OTHER LANGUAGES:
Norm Processing Programs
- Calculating Orthographic/Phonological Neighbourhoods (Coltheart's N)
- Calculating Orthographic/Phonological Onset and Offset Neighbourhoods (restricted Coltheart's N)
- Calculating Summed Token (or Type) Length-Specific Positional Bigram Frequency, Letter (Unigram) Frequency, and Bigram and Letter Legality
- Calculating Clustering Coefficients
Audio Processing
Norms
ENGLISH:
UNION: (English, USA). Merged of SUBTL and the CMU Pronounciation Dicationary -- Expanded
The UNION database was created by taking the union of all of the words in the SUBTL word frequency norms with frequencies greater than or equal to 1 (Brysbaert & New, 2009) with those of all of the words with pronounciations in a syllabified veresion of the CMU pronounciation dictionary (Bartlett, Kondrak, & Cherry, 2009). Several additional properties have also been added for all of the words in the database
Homophones and Homographs from UNION (Merged SUBTL and CMU Pronounciation Dictionary, English, USA)
Homophones and homonyms (with and without stress information) for all of the items with word frequencies equal to or greater than 1 in the union of the SUBTL and CMU Pronounciation dicationaries (see above). Note that a portion of these entries, and in particular, those from the version that was sensitive to stress information, reflect dialect differences. [download 40 kb .zip]
Number of Meanings, Number of Senses, and Part-of-speech frequencies from the Wordsmyth Online Dictionary
Data from all 47909 single word and phrase entries in the Wordsmyth online dictionary (Excel version). Additional details regarding the parsing and extraction process are available from the eDom website [download 1 mb .zip]
Wall Street Journal Word Frequency Database (English, USA)
Word frequencies, in words per million, as extracted from a corpus derived from the Wall Street Journal, in Excel and .txt formats. [download 3 mb .zip]
Archive of University of Alberta Norms of Relative Meaning Frequency for 566 Homographs by Twilley, Dixon, Taylor, & Clark (1994)
OTHER LANGUAGES:
EsPal: (Spanish, Spain) -- Expanded
Text and Excel verisons of an expanded EsPal database word property database, as originally reported in Duchon, Perea, Sebastian-Galles, Marti, & Carreiras (2013). EsPal: One-stop shopping for Spanish word properties. Behavior Research Methods. DOI: 10.3758/s13428-013-0326-1
Lexique 3 (French, France) -- Expanded
Text and Excel verisons of an expanded Lexique 3 database, as originally reported by New, B. (2006). Lexique 3: Une nouvelle base de données lexicales. Actes de la Conférence Traitement Automatique des Langues Naturelles (TALN 2006), Avril 2006, Louvain, Belgique.
NORM PROCESSING PROGRAMS
Calculating Orthographic/Phonological Neighbourhoods (Coltheart's N)
coltheartN.py [download 2kb .zip]
This program calculates the number of words in <wordcorpus> that are only one letter different from each word in <targetList>. Unicode support has been added so that the program works on non-English characters such as French accents, in which case accented letters are considered to be different letters. Assuming a phonological coding scheme where each phoneme is coded as a single character, this program can also be used to calculate phonological neighbourhoods.
Requirements: Python 2.7.x
Usage:
python coltheartN.py <wordCorpus> <targetList> <cast lowercase: 0/1> <outfile>
Output:
<outfile> will contain the following information for each word in targetList, with each entry placed on a separate line:
<targetWord> <N> <If neighbours exist, pipe-delimited list of neighbours>
Calculating Orthographic/Phonological Onset and Offset Neighbourhoods (restricted Coltheart's N)
onoffColtN.py [download 1kb .zip]
This program calculates the number of words in <wordcorpus> that are only one letter different from each word in <targetList> either in terms of their first (onset) or last (offset) letter. Unicode support has been added so that the program works on non-English characters such as French accents. Accented letters are considered to be different letters. Assuming a phonological coding scheme where each phoneme is coded as a single character, this program can also be used to calculate phonological neighbourhoods.
Requirements: Python 2.7.x
Usage:
python coltheartN.py <wordCorpus> <targetList> <cast lowercase: 0/1> <outfile>
Output:
<outfile> will contain the following information for each word in targetList, with each entry placed on a separate line:
<targetWord> <Non> <If neighbours, pipe delimited list of onset neighbours> <Noff> <If neighbours, pipe delimited list of offset neighbours>
Calculating Summed Token (or Type) Length-Specific Positional Bigram Frequency, Letter (Unigram) Frequency, and Bigram and Letter Legality
posBigramUniLegalBigramUni.py [download 2kb .zip]
This
program calculates token-based summed positional bigram frequencies and
unigram (letter) frequencies, as well as reports whether a given string
only contains unique unigrams and bigrams per those that are
represented in <wordCorpus>. Note that if using a
list of unique words, e.g., such as those from a word frequency corpus,
then this program will effectively calculate type as opposed to token
statistics. This program supports the use of non-English
characters such as French accents, in which case these additional
characters are treated in the same was a different letters.
Requirements: Python 2.7.x
Usage:
python coltheartN.py <wordCorpus> <targetList> <cast lowercase: 0/1> <outfile>
Output:
<outfile> will contain the following information for each word in targetList, with each entry placed on a separate line:
<targetWord> <sumPosBi> <legalBi 1/0> <sumUni> <legalUni 1/0>
Calculating Clustering Coefficients
clusCoeff.py [download 500kb .zip]
This
program calculates the clustering coefficient (cc) between all the
neighbours of a target item's set of neighbours, as well as the number
of connections and theoretical maximum number of connections between
items in the cluster. Source frile is bundled with an example for
calculating nonword neighbours in English. In principle, this
script could be used to calculate neighbourhood clustering coefficients
for any type of items, be their orthographic neighbours, phonological
neighbours, or otherwise.
Requirements: Python 3.2.x
Usage:
python clusCoeff.py
User parameters to set:
sfn # sample items to calculate the cc for, and their neighbours. the target
# item should be separated from the neighbour list by a tab, and the
# neighbour items should be separated by semi-colons.
pfn # population of items used to look up the neighbours of the neighbours
# in the sfn. File follows the same format as for the sample items.
ofn # File to write the output
OUTPUT:
<ofn> will contain the following data for each sample item, one item per line
<item> <number of connections> <max possible connections <cc>
CheckVocal_mod.zip [download 5mb .zip]
This
modified version of CheckVocal (Protopapas, 2007, BRM) looks for a .wav
file called 'tmp-aud.wav' in ./tmp-audio and prints out the offset, in
milliseconds, when speech was detected in the file, or 0 if no offset
was found (or technically, if the offset was exactly zero, but for
experimental purposes, starting the recording on stimulus onset
precludes this). All of the default parameters from the
CheckVocal software are used. This is the same software used in
DMDX to detect onsets, but suitable for command line batch processing
of wav files.
Full credit for this software should be attributed to
Protopapas, A. (2007). CheckVocal: A program to facilitate checking the accuracy and response time of vocal responses from DMDX. Behavior Research Methods, 39(4), 859-862.
This modification
is just a rough hack of that software to automate the detection of an
onset for a single file without needing to work through the GUI, and
thus be able to use it for automated batch processing in other
software.
An easy way to interface this code with any other piece of software is
to externally execute the CheckVocal.bat file and capture the input, after
moving the target audio file to./tmp-aud/tmp-aud.wav.
Including
start-up time, the program takes roughly the same length of time as the
length of the recording to run for short files in the 1-2 second range
(it can be faster, but longer run times occur if the offset is not
found until later in the file). Thus, this modified code is suitable
for automated on-line speech detection in many psychological
experiments (e.g,. word naming).
If you want to do more
sophisticated analyses, the actual CheckVocal software (available from
http://users.uoa.gr/~aprotopapas/CV/checkvocal.html, and archived along
with this modified version in both src and binary forms, see
./CheckVocal_exe_src_orig) may be more suitable for your purposes.
Requirements: Python 2.x and the snack libraries (confirmed working with snack2210-py). Follow the instructions in the snack_x.zip file in ./lib to install them.
Usage:
CheckVocal.bat
Input:
<Audio file in ./tmp-aud, named tmp-aud.wav>
Output:
Time, in ms, at which speech onset was detected in the file, or 0, if no onset was detected (technically, it will also return zero if onset was detected at the start of the file, but starting a recording at stimulus onset in an experiment precludes this from being a valid response).