How to translate words in NTLK swadesh corpus rega

2020-07-24 07:51发布

问题:

I'm new to python and natural language processing, and I'm trying to learn using the nltk book. I'm doing the exercises at the end of Chapter 2, and there is a question I'm stuck on. "In the discussion of comparative wordlists, we created an object called translate which you could look up using words in both German and Italian in order to get corresponding words in English. What problem might arise with this approach? Can you suggest a way to avoid this problem?"

The book had me use the swadesh corpus to create a 'translator', as follows:

`from nltk.corpus import swadesh
fr2en = swadesh.entries(['fr', 'en'])
de2en = swadesh.entries(['de', 'en'])
es2en = swadesh.entries(['es', 'en'])
translate = dict(fr2en)
translate.update(dict(de2en))
translate.update(dict(es2en))`

One problem I saw was that when you translate the German word for dog (hund) to English, it only takes the uppercase form: translate['Hund'] returns 'dog', while translate['hund'] returns KeyError: 'hund'

Is there a way to make the translator translate words regardless of case? I've been playing around with it, like doing translate.update(dict(de2en.lower)) and what not to no avail. I feel like I'm missing something obvious. Could anyone help me?

Thanks!

回答1:

Ah, the infamous capitalization of Nouns in German (see http://german.about.com/library/weekly/aa020919a.htm)

You could try a list comprehension and lower each token from the swadesh corpus:

>>> from nltk.corpus import swadesh
>>> de2en = [(i.lower(),j.lower()) for i,j in swadesh.entries(['de','en'])]
>>> translate = dict(de2en)
>>> translate['hund']
u'dog'
>>> translate['Hund']
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: 'Hund'

But you would have lost the capitalization in the key. So to resolve that you can update the translate dictionary again with the original swadesh entries:

>>> from nltk.corpus import swadesh
>>> de2en = [(i.lower(),j.lower()) for i,j in swadesh.entries(['de','en'])]
>>> translate = dict(de2en)
>>> translate.update(swadesh.entries(['de','en']))
>>> translate['hund']
u'dog'
>>> translate['Hund']
u'dog'