NLTK word_tokenize on French text is not woking pr

2019-04-10 08:33发布

问题:

I'm trying to use NLTK word_tokenize on a text in French by using :

txt = ["Le télétravail n'aura pas d'effet sur ma vie"]
print(word_tokenize(txt,language='french'))

it should print:

['Le', 'télétravail', 'n'','aura', 'pas', 'd'','effet', 'sur', 'ma', 'vie','.']

But I get:

['Le', 'télétravail', "n'aura", 'pas', "d'effet", 'sur', 'ma', 'vie','.']

Does anyone know why it's not spliting tokens properly in French and how to overcome this (and other potential issues) when doing NLP in French?

回答1:

I don't think there's an explicit French model for word_tokenize (which is the modified treebank tokenizer used for the English Penn Treebank). '

The word_tokenize function performs sentence tokenization using the sent_tokenize function before the actual word tokenization. The language argument in word_tokenize is only used for the sent_tokenize part.

Alternatively, you can use the MosesTokenizer that has some language dependent regexes (and it does support French):

>>> from nltk.tokenize.moses import MosesTokenizer
>>> moses = MosesTokenizer(lang='fr')
>>> sent = u"Le télétravail n'aura pas d'effet sur ma vie"
>>> moses.tokenize(sent)
[u'Le', u't\xe9l\xe9travail', u'n'', u'aura', u'pas', u'd'', u'effet', u'sur', u'ma', u'vie']

If you want don't like it that Moses escape special XML characters, you can do:

>>> moses.tokenize(sent, escape=False)
[u'Le', u't\xe9l\xe9travail', u"n'", u'aura', u'pas', u"d'", u'effet', u'sur', u'ma', u'vie']

To explain why splitting n' and d' is useful in French NLP.

Linguistically, separating the n' and d' does make sense because they're clitiques that have their own syntactic and semantic properties but bounded to the root/host.

In French, ne ... pas would have been a discontinuous constituent to denote negation, the clitique nature of ne going to n' is because of the vowel onset in the word following ne, so splitting the n' from the aura does make it easier to identify ne ... pas.

In the case of d', it's the same phonetic motivation of the vowel onset in the following word to go from de effet -> d'effet.



回答2:

Looking at the source of word_tokenize reveals, that the language argument is only used to determine how to split the input into sentences. And for tokenization on word level, a (slightly modified) TreebankWordTokenizer is used which will work best for english input and contractions like don't. From nltk/tokenize/__init__.py:

_treebank_word_tokenizer = TreebankWordTokenizer()
# ... some modifications done
def word_tokenize(text, language='english', preserve_line=False):
    # ...
    sentences = [text] if preserve_line else sent_tokenize(text, language)
    return [token for sent in sentences
            for token in _treebank_word_tokenizer.tokenize(sent)]

To get your desired output, you might want to consider using a different tokenizer like a RegexpTokenizer as following:

txt = "Le télétravail n'aura pas d'effet sur ma vie"
pattern = r"[dnl]['´`]|\w+|\$[\d\.]+|\S+"
tokenizer = RegexpTokenizer(pattern)
tokenizer.tokenize(txt)
# ['Le', 'télétravail', "n'", 'aura', 'pas', "d'", 'effet', 'sur', 'ma', 'vie']

My knowledge of French is limited and this only solves the stated problem. For other cases you will have to adapt the pattern. You can also look at the implementation of the TreebankWordTokenizer for ideas of a more complex solution. Also keep in mind that this way you will need to split sentences beforehand, if necessary.