Removing non-English words from text using Python

2019-04-05 16:45发布

问题:

I am doing a data cleaning exercise on python and the text that I am cleaning contains Italian words which I would like to remove. I have been searching online whether I would be able to do this on Python using a tool kit like nltk.

For example given some text :

"Io andiamo to the beach with my amico."

I would like to be left with :

"to the beach with my" 

Does anyone know of a way as to how this could be done? Any help would be much appreciated.

回答1:

You can use the words corpus from NLTK:

import nltk
words = set(nltk.corpus.words.words())

sent = "Io andiamo to the beach with my amico."
" ".join(w for w in nltk.wordpunct_tokenize(sent) \
         if w.lower() in words or not w.isalpha())
# 'Io to the beach with my'

Unfortunately, Io happens to be an English word. In general, it may be hard to decide whether a word is English or not.



回答2:

There's a good Python library called Enchant. It can check if a word is English.

From their homepage:

>>> import enchant
>>> d = enchant.Dict("en_US")
>>> d.check("Hello")
True
>>> d.check("Helo")
False
>>> d.suggest("Helo")
['He lo', 'He-lo', 'Hello', 'Helot', 'Help', 'Halo', 'Hell', 'Held', 'Helm', 'Hero', "He'll"]

So you could do something like:

string =  "Io andiamo to the beach with my amico."
english_words = []
for word in string.split():
    if d.check(word):
        english_words.append(word)
print " ".join(english_words)

NOTE: small words are hard to determine language, being that many small words can be in many different languages, thus the result from the above code is:

Io to the beach with my

Where you wished that Io would have been excluded