-->

What is the difference between corpus and lexicon

2019-05-04 10:45发布

问题:

Can someone tell me the difference between a Corpora ,corpus and lexicon in NLTK ?

What is the movie data set ?

what is Wordnet ?

回答1:

Corpora is the plural for corpus.

Corpus basically means a body, and in the context of Natural Language Processing (NLP), it means a body of text.

(source: https://www.google.com.sg/search?q=corpora)


Lexicon is a vocabulary, a list of words, a dictionary (source: https://www.google.com.sg/search?q=lexicon)

In NLTK, any lexicon is considered a corpus since a list of words is also a body of text. E.g. a list of stopwords can be found in NLTK corpus API:

>>> from nltk.corpus import stopwords
>>> print stopwords.words('english')
[u'i', u'me', u'my', u'myself', u'we', u'our', u'ours', u'ourselves', u'you', u'your', u'yours', u'yourself', u'yourselves', u'he', u'him', u'his', u'himself', u'she', u'her', u'hers', u'herself', u'it', u'its', u'itself', u'they', u'them', u'their', u'theirs', u'themselves', u'what', u'which', u'who', u'whom', u'this', u'that', u'these', u'those', u'am', u'is', u'are', u'was', u'were', u'be', u'been', u'being', u'have', u'has', u'had', u'having', u'do', u'does', u'did', u'doing', u'a', u'an', u'the', u'and', u'but', u'if', u'or', u'because', u'as', u'until', u'while', u'of', u'at', u'by', u'for', u'with', u'about', u'against', u'between', u'into', u'through', u'during', u'before', u'after', u'above', u'below', u'to', u'from', u'up', u'down', u'in', u'out', u'on', u'off', u'over', u'under', u'again', u'further', u'then', u'once', u'here', u'there', u'when', u'where', u'why', u'how', u'all', u'any', u'both', u'each', u'few', u'more', u'most', u'other', u'some', u'such', u'no', u'nor', u'not', u'only', u'own', u'same', u'so', u'than', u'too', u'very', u's', u't', u'can', u'will', u'just', u'don', u'should', u'now']

The movie review dataset in NLTK (canonically known as Movie Reviews Corpus) is a text dataset of 2k movie reviews with sentiment polarity classification (source: http://www.nltk.org/book/ch02.html)

And it is often used for tutorial purposes for introduction to NLP and sentiment analysis, see http://www.nltk.org/book/ch06.html and nltk NaiveBayesClassifier training for sentiment analysis


WordNet is lexical database for the English language (it's like a lexicon/dictionary with word-to-word relations) (source: https://wordnet.princeton.edu/).

In NLTK, it incorporates the Open Multilingual WordNet (http://compling.hss.ntu.edu.sg/omw/) that allows you to query the words in other languages.

Since it is also a list of words (in this case with many other things included, relations, lemmas, POS, etc.), it's also invoked using nltk.corpus in NLTK.

The canonical idiom to use the wordnet in NLTK is as such:

>>> from nltk.corpus import wordnet as wn
>>> wn.synsets('dog')
[Synset('dog.n.01'), Synset('frump.n.01'), Synset('dog.n.03'), Synset('cad.n.01'), Synset('frank.n.02'), Synset('pawl.n.01'), Synset('andiron.n.01'), Synset('chase.v.01')]

The easiest way to understand/learn the NLP jargons and the basics is to go through these tutorial in the NLTK book: http://www.nltk.org/book/