I'm trying to use NLTK word_tokenize
on a text in French by using :
txt = ["Le télétravail n'aura pas d'effet sur ma vie"]
print(word_tokenize(txt,language='french'))
it should print:
['Le', 'télétravail', 'n'','aura', 'pas', 'd'','effet', 'sur', 'ma', 'vie','.']
But I get:
['Le', 'télétravail', "n'aura", 'pas', "d'effet", 'sur', 'ma', 'vie','.']
Does anyone know why it's not spliting tokens properly in French and how to overcome this (and other potential issues) when doing NLP in French?
I don't think there's an explicit French model for word_tokenize
(which is the modified treebank tokenizer used for the English Penn Treebank). '
The word_tokenize
function performs sentence tokenization using the sent_tokenize
function before the actual word tokenization. The language
argument in word_tokenize
is only used for the sent_tokenize
part.
Alternatively, you can use the MosesTokenizer
that has some language dependent regexes (and it does support French):
>>> from nltk.tokenize.moses import MosesTokenizer
>>> moses = MosesTokenizer(lang='fr')
>>> sent = u"Le télétravail n'aura pas d'effet sur ma vie"
>>> moses.tokenize(sent)
[u'Le', u't\xe9l\xe9travail', u'n'', u'aura', u'pas', u'd'', u'effet', u'sur', u'ma', u'vie']
If you want don't like it that Moses escape special XML characters, you can do:
>>> moses.tokenize(sent, escape=False)
[u'Le', u't\xe9l\xe9travail', u"n'", u'aura', u'pas', u"d'", u'effet', u'sur', u'ma', u'vie']
To explain why splitting n'
and d'
is useful in French NLP.
Linguistically, separating the n'
and d'
does make sense because they're clitiques that have their own syntactic and semantic properties but bounded to the root/host.
In French, ne ... pas
would have been a discontinuous constituent to denote negation, the clitique nature of ne
going to n'
is because of the vowel onset in the word following ne
, so splitting the n'
from the aura
does make it easier to identify ne ... pas
.
In the case of d'
, it's the same phonetic motivation of the vowel onset in the following word to go from de effet
-> d'effet
.
Looking at the source of word_tokenize
reveals, that the language
argument is only used to determine how to split the input into sentences.
And for tokenization on word level, a (slightly modified) TreebankWordTokenizer
is used which will work best for english input and contractions like don't.
From nltk/tokenize/__init__.py
:
_treebank_word_tokenizer = TreebankWordTokenizer()
# ... some modifications done
def word_tokenize(text, language='english', preserve_line=False):
# ...
sentences = [text] if preserve_line else sent_tokenize(text, language)
return [token for sent in sentences
for token in _treebank_word_tokenizer.tokenize(sent)]
To get your desired output, you might want to consider using a different tokenizer like a RegexpTokenizer
as following:
txt = "Le télétravail n'aura pas d'effet sur ma vie"
pattern = r"[dnl]['´`]|\w+|\$[\d\.]+|\S+"
tokenizer = RegexpTokenizer(pattern)
tokenizer.tokenize(txt)
# ['Le', 'télétravail', "n'", 'aura', 'pas', "d'", 'effet', 'sur', 'ma', 'vie']
My knowledge of French is limited and this only solves the stated problem. For other cases you will have to adapt the pattern.
You can also look at the implementation of the TreebankWordTokenizer
for ideas of a more complex solution.
Also keep in mind that this way you will need to split sentences beforehand, if necessary.