Usually word lists are 1 file that contains everything, but are there separately downloadable noun list, verb list, adjective list, etc?
I need them for English specifically.
Usually word lists are 1 file that contains everything, but are there separately downloadable noun list, verb list, adjective list, etc?
I need them for English specifically.
As others have suggested, the WordNet database files are a great source for parts of speech. That said, the examples used to extract the words isn't entirely correct. Each line is actually a "synonym set" consisting of multiple synonyms and their definition. Around 30% of words only appear as synonyms, so simply extracting the first word is missing a large amount of data.
The line format is pretty simple to parse (
search.c
, functionparse_synset
), but if all you're interested in are the words, the relevant part of the line is formatted as:These correspond to:
For example, from
data.adj
:s
, corresponding to adjective (wnutil.c
, functiongetpos
)cut
with lexical ID 0shortened
with lexical ID 0A short Perl script to simply dump the words from the
data.*
files:A gist of the above script can be found here.
A more robust parser which stays true to the original source can be found here.
Both scripts are used in a similar fashion:
./wordnet_parser.pl DATA_FILE
.This is a highly ranked Google result, so I'm digging up this 2 year old question to provide a far better answer than the existing one.
The "Kevin's Word Lists" page provides old lists from the year 2000, based on WordNet 1.6.
You are far better off going to https://wordnet.princeton.edu/download/current-version and downloading WordNet 3.0 (the Database-only version) or whatever the latest version is when you're reading this.
Parsing it is very simple; just apply a regex of
"/^(\S+?)[\s%]/"
to grab every word, and then replace all"_"
(underscores) in the results with spaces. Finally, dump your results to whatever storage format you want. You'll be given separate lists of adjectives, adverbs, nouns, verbs and even a special (very useless/useful depending on what you're doing) list called "senses" which relates to our senses of smell, sight, hearing, etc, i.e. words such as "shirt" or "pungent".Enjoy! Remember to include their copyright notice if you're using it in a project.
See Kevin's word lists. Particularly the "Part Of Speech Database." You'll have to do some minimal text-processing on your own, in order to get the database into multiple files for yourself, but that can be done very easily with a few
grep
commands.The license terms are available on the "readme" page.
If you download just the database files from wordnet.princeton.edu/download/current-version you can extract the words by running these commands:
Or if you only want single words (no underscores)
http://icon.shef.ac.uk/Moby/mpos.html
Each part-of-speech vocabulary entry consists of a word or phrase field followed by a field delimiter of (ASCII 215) and the part-of-speech field that is coded using the following ASCII symbols (case is significant):