I'm struggling with NLTK stopword.
Here's my bit of code.. Could someone tell me what's wrong?
from nltk.corpus import stopwords
def removeStopwords( palabras ):
return [ word for word in palabras if word not in stopwords.words('spanish') ]
palabras = ''' my text is here '''
Using a tokenizer first you compare a list of tokens (symbols) against the stoplist, so you don't need the re module. I added an extra argument in order to switch among languages.
Dime si te fue de util ;)
Your problem is that the iterator for a string returns each character not each word.
For example:
You need to iterate and check each word, fortunately the split function already exists in the python standard library under the string module. However you are dealing with natural language including punctuation you should look here for a more robust answer that uses the
re
module.Once you have a list of words you should lowercase them all before comparison and then compare them in the manner that you have shown already.
Buena suerte.
EDIT 1
Okay try this code, it should work for you. It shows two ways to do it, they are essentially identical but the first is a bit clearer while the second is more pythonic.
I hope this helps you.