I am trying to process a user entered text by removing stopwords using nltk toolkit, but with stopword-removal the words like 'and', 'or', 'not' gets removed. I want these words to be present after stopword removal process as they are operators which are required for later processing text as query. I don't know which are the words which can be operators in text query, and I also want to remove unnecessary words from my text.
相关问题
- how to define constructor for Python's new Nam
- streaming md5sum of contents of a large remote tar
- How to get the background from multiple images by
- Evil ctypes hack in python
- Correctly parse PDF paragraphs with Python
There is an in-built stopword list in
NLTK
made up of 2,400 stopwords for 11 languages (Porter et al), see http://nltk.org/book/ch02.htmlI recommend looking at using tf-idf to remove stopwords, see Effects of Stemming on the term frequency?
@alvas's answer does the job but it can be done way faster. Assuming that you have
documents
: a list of strings.Notice that due to the fact that here you are searching in a set (not in a list) the speed would be theoretically
len(stop_words)/2
times faster, which is significant if you need to operate through many documents.For 5000 documents of approximately 300 words each the difference is between 1.8 seconds for my example and 20 seconds for @alvas's.
P.S. in most of the cases you need to divide the text into words to perform some other classification tasks for which tf-idf is used. So most probably it would be better to use stemmer as well:
and to use
[porter.stem(i.lower()) for i in wordpunct_tokenize(doc) if i.lower() not in stop_words]
inside of a loop.I suggest you create your own list of operator words that you take out of the stopword list. Sets can be conveniently subtracted, so:
Then you can simply test if a word is
in
ornot in
the set without relying on whether your operators are part of the stopword list. You can then later switch to another stopword list or add an operator.You can use string.punctuation with built-in NLTK stopwords list:
NLTK stopwords complete list
@alvas has a good answer. But again it depends on the nature of the task, for example in your application you want to consider all
conjunction
e.g. and, or, but, if, while and alldeterminer
e.g. the, a, some, most, every, no as stop words considering all others parts of speech as legitimate, then you might want to look into this solution which use Part-of-Speech Tagset to discard words, Check table 5.1: