Tools

--- Home ---

This page contains a broad set of tools that I have found useful in my research. Both tools that I have created myself and tools that have been shared with me by other researchers are included here for the purpose of rapid dissemination from a central location.

Unless otherwise noted, all of the materials that I have created from the ground up are released under the GPL v3 and the Creative Commons licences (CC By 2.0). The rest remain the property of the respective owners.

List of Programs, Scripts, and Norms

PUBLISHED SOFTWARE AND NORMS:

Norms

ENGLISH:

UNION: (English, USA). Merged of SUBTL and the CMU Pronounciation Dicationary -- Expanded
Homophones and Homographs from UNION (Merged SUBTL and CMU Pronounciation Dictionary)
Number of Meanings, Number of Senses, and Part-of-speech frequencies from the Wordsmyth Online Dictionary
Wall Street Journal Word Frequency Counts
Archive of University of Alberta Norms of Relative Meaning Frequency for 566 Homographs by Twilley, Dixon, Taylor, & Clark (1994)

OTHER LANGUAGES:

Norm Processing Programs

Calculating Orthographic/Phonological Neighbourhoods (Coltheart's N)
Calculating Orthographic/Phonological Onset and Offset Neighbourhoods (restricted Coltheart's N)
Calculating Summed Token (or Type) Length-Specific Positional Bigram Frequency, Letter (Unigram) Frequency, and Bigram and Letter Legality
Calculating Clustering Coefficients

Audio Processing

CheckVocal_mod: Automatic voice onset detection

Norms

ENGLISH:

UNION: (English, USA). Merged of SUBTL and the CMU Pronounciation Dicationary -- Expanded

The UNION database was created by taking the union of all of the words in the SUBTL word frequency norms with frequencies greater than or equal to 1 (Brysbaert & New, 2009) with those of all of the words with pronounciations in a syllabified veresion of the CMU pronounciation dictionary (Bartlett, Kondrak, & Cherry, 2009). Several additional properties have also been added for all of the words in the database

Homophones and Homographs from UNION (Merged SUBTL and CMU Pronounciation Dictionary, English, USA)

Homophones and homonyms (with and without stress information) for all of the items with word frequencies equal to or greater than 1 in the union of the SUBTL and CMU Pronounciation dicationaries (see above). Note that a portion of these entries, and in particular, those from the version that was sensitive to stress information, reflect dialect differences. [download 40 kb .zip]

Number of Meanings, Number of Senses, and Part-of-speech frequencies from the Wordsmyth Online Dictionary

Data from all 47909 single word and phrase entries in the Wordsmyth online dictionary (Excel version). Additional details regarding the parsing and extraction process are available from the eDom website [download 1 mb .zip]

Wall Street Journal Word Frequency Database (English, USA)

Word frequencies, in words per million, as extracted from a corpus derived from the Wall Street Journal, in Excel and .txt formats. [download 3 mb .zip]

Archive of University of Alberta Norms of Relative Meaning Frequency for 566 Homographs by Twilley, Dixon, Taylor, & Clark (1994)

[Download 61 kb .xlsx]

OTHER LANGUAGES:

EsPal: (Spanish, Spain) -- Expanded

Text and Excel verisons of an expanded EsPal database word property database, as originally reported in Duchon, Perea, Sebastian-Galles, Marti, & Carreiras (2013). EsPal: One-stop shopping for Spanish word properties. Behavior Research Methods. DOI: 10.3758/s13428-013-0326-1

Lexique 3 (French, France) -- Expanded

Text and Excel verisons of an expanded Lexique 3 database, as originally reported by New, B. (2006). Lexique 3: Une nouvelle base de données lexicales. Actes de la Conférence Traitement Automatique des Langues Naturelles (TALN 2006), Avril 2006, Louvain, Belgique.

NORM PROCESSING PROGRAMS

Calculating Orthographic/Phonological Neighbourhoods (Coltheart's N)

coltheartN.py [download 2kb .zip]

This program calculates the number of words in <wordcorpus> that are only one letter different from each word in <targetList>. Unicode support has been added so that the program works on non-English characters such as French accents, in which case accented letters are considered to be different letters. Assuming a phonological coding scheme where each phoneme is coded as a single character, this program can also be used to calculate phonological neighbourhoods.

Requirements: Python 2.7.x

Usage:
python coltheartN.py <wordCorpus> <targetList> <cast lowercase: 0/1> <outfile>

Output:
<outfile> will contain the following information for each word in targetList, with each entry placed on a separate line:

Calculating Orthographic/Phonological Onset and Offset Neighbourhoods (restricted Coltheart's N)

onoffColtN.py [download 1kb .zip]

This program calculates the number of words in <wordcorpus> that are only one letter different from each word in <targetList> either in terms of their first (onset) or last (offset) letter. Unicode support has been added so that the program works on non-English characters such as French accents. Accented letters are considered to be different letters. Assuming a phonological coding scheme where each phoneme is coded as a single character, this program can also be used to calculate phonological neighbourhoods.

Requirements: Python 2.7.x

Usage:
python coltheartN.py <wordCorpus> <targetList> <cast lowercase: 0/1> <outfile>

Output:
<outfile> will contain the following information for each word in targetList, with each entry placed on a separate line:

Calculating Summed Token (or Type) Length-Specific Positional Bigram Frequency, Letter (Unigram) Frequency, and Bigram and Letter Legality

posBigramUniLegalBigramUni.py [download 2kb .zip]

This program calculates token-based summed positional bigram frequencies and unigram (letter) frequencies, as well as reports whether a given string only contains unique unigrams and bigrams per those that are represented in <wordCorpus>. Note that if using a list of unique words, e.g., such as those from a word frequency corpus, then this program will effectively calculate type as opposed to token statistics. This program supports the use of non-English characters such as French accents, in which case these additional characters are treated in the same was a different letters.

Requirements: Python 2.7.x

Usage:

python coltheartN.py <wordCorpus> <targetList> <cast lowercase: 0/1> <outfile>

Output:

<outfile> will contain the following information for each word in targetList, with each entry placed on a separate line:

Calculating Clustering Coefficients

clusCoeff.py [download 500kb .zip]

This program calculates the clustering coefficient (cc) between all the neighbours of a target item's set of neighbours, as well as the number of connections and theoretical maximum number of connections between items in the cluster. Source frile is bundled with an example for calculating nonword neighbours in English. In principle, this script could be used to calculate neighbourhood clustering coefficients for any type of items, be their orthographic neighbours, phonological neighbours, or otherwise.

Requirements: Python 3.2.x

Usage:
    python clusCoeff.py

User parameters to set:
   sfn # sample items to calculate the cc for, and their neighbours. the target
       # item should be separated from the neighbour list by a tab, and the
       # neighbour items should be separated by semi-colons.
   pfn # population of items used to look up the neighbours of the neighbours
        # in the sfn. File follows the same format as for the sample items.
   ofn # File to write the output

OUTPUT:
   <ofn> will contain the following data for each sample item, one item per line
       <item> <number of connections> <max possible connections <cc>

CheckVocal_mod

CheckVocal_mod.zip [download 5mb .zip]

This modified version of CheckVocal (Protopapas, 2007, BRM) looks for a .wav file called 'tmp-aud.wav' in ./tmp-audio and prints out the offset, in milliseconds, when speech was detected in the file, or 0 if no offset was found (or technically, if the offset was exactly zero, but for experimental purposes, starting the recording on stimulus onset precludes this). All of the default parameters from the CheckVocal software are used. This is the same software used in DMDX to detect onsets, but suitable for command line batch processing of wav files.

Full credit for this software should be attributed to

Protopapas, A. (2007). CheckVocal: A program to facilitate checking the accuracy and response time of vocal responses from DMDX. Behavior Research Methods, 39(4), 859-862.

This modification is just a rough hack of that software to automate the detection of an onset for a single file without needing to work through the GUI, and thus be able to use it for automated batch processing in other software.

An easy way to interface this code with any other piece of software is to externally execute the CheckVocal.bat file and capture the input, after moving the target audio file to./tmp-aud/tmp-aud.wav.

Including start-up time, the program takes roughly the same length of time as the length of the recording to run for short files in the 1-2 second range (it can be faster, but longer run times occur if the offset is not found until later in the file). Thus, this modified code is suitable for automated on-line speech detection in many psychological experiments (e.g,. word naming).

If you want to do more sophisticated analyses, the actual CheckVocal software (available from http://users.uoa.gr/~aprotopapas/CV/checkvocal.html, and archived along with this modified version in both src and binary forms, see ./CheckVocal_exe_src_orig) may be more suitable for your purposes.

Requirements: Python 2.x and the snack libraries (confirmed working with snack2210-py). Follow the instructions in the snack_x.zip file in ./lib to install them.

Usage:

CheckVocal.bat

Input:

Output:

Time, in ms, at which speech onset was detected in the file, or 0, if no onset was detected (technically, it will also return zero if onset was detected at the start of the file, but starting a recording at stimulus onset in an experiment precludes this from being a valid response).

Blair Armstrong, 2011-