Has anyone parsed Wiktionary? [closed]

2019-01-10 04:01发布

Wiktionary is a wiki dictionary that covers many languages. It even has translations. I would be interested in parsing it and playing with the data, has anyone does anything like this before? Is there any library I can use? (Preferably Python.)

11条回答
▲ chillily
2楼-- · 2019-01-10 04:30

wordnik has done a good job parsing-out definitions, etc and they have a great api

like the others have mentioned, wiktionary is a formatting-disaster, and was not built to be computer-readable

查看更多
该账号已被封号
3楼-- · 2019-01-10 04:36

You may be interested in dbnary project, not python but interesting. Claims support parsing for 21 languages and it powers wikdict.

查看更多
乱世女痞
4楼-- · 2019-01-10 04:39

It depends on how thoroughly you need to parse it. If you just need to get all contents of a word in a language (definition, etymology, pronunciation, conjugation, etc.) then it's pretty easy. I had done this before, although in Java using jsoup

However, if you need to parse it down to different components of the content (e.g. just getting the definitions of a word), then it will be much more challenging. A Wiktionary entry for a word in a language has no pre-defined template, so a header can be anything from <h3> to <h6>, the order of the sections may be jumbled, they can be repetitive, etc.

查看更多
Melony?
5楼-- · 2019-01-10 04:39

I wrote a primitive parser for the German Wiktionary dump in Java that only extracts nouns and their articles, plus their Arabic translation, without any dependencies. Execution takes a long time, so be warned. If there’s interest/need to parse more or other data, please tell me, I might look into it as time permits.

查看更多
祖国的老花朵
6楼-- · 2019-01-10 04:41

I just made a word list from the German dump like that:

bzcat pages-articles.xml.bz2 | grep '<title>[^[:space:][:punct:]]*</title>' | sed 's:.*<title>\(.*\)</title>.*:\1:' > words
查看更多
Deceive 欺骗
7楼-- · 2019-01-10 04:42

Wiktionary runs on MediaWiki, which has an API.

One of the subpages for the API documentation is Client code, which lists some Python libraries.

查看更多
登录 后发表回答