Has anyone parsed Wiktionary? [closed]

Wiktionary is a wiki dictionary that covers many languages. It even has translations. I would be interested in parsing it and playing with the data, has anyone does anything like this before? Is there any library I can use? (Preferably Python.)

标签： python web-services dictionary wiktionary

11条回答

▲ chillily

2楼-- · 2019-01-10 04:30

wordnik has done a good job parsing-out definitions, etc and they have a great api

like the others have mentioned, wiktionary is a formatting-disaster, and was not built to be computer-readable

0人赞添加讨论(0) 举报

该账号已被封号

3楼-- · 2019-01-10 04:36

You may be interested in dbnary project, not python but interesting. Claims support parsing for 21 languages and it powers wikdict.

0人赞添加讨论(0) 举报

乱世女痞

4楼-- · 2019-01-10 04:39

It depends on how thoroughly you need to parse it. If you just need to get all contents of a word in a language (definition, etymology, pronunciation, conjugation, etc.) then it's pretty easy. I had done this before, although in Java using jsoup

However, if you need to parse it down to different components of the content (e.g. just getting the definitions of a word), then it will be much more challenging. A Wiktionary entry for a word in a language has no pre-defined template, so a header can be anything from <h3> to <h6>, the order of the sections may be jumbled, they can be repetitive, etc.

0人赞添加讨论(0) 举报

Melony?

5楼-- · 2019-01-10 04:39

I wrote a primitive parser for the German Wiktionary dump in Java that only extracts nouns and their articles, plus their Arabic translation, without any dependencies. Execution takes a long time, so be warned. If there’s interest/need to parse more or other data, please tell me, I might look into it as time permits.

0人赞添加讨论(0) 举报

祖国的老花朵

6楼-- · 2019-01-10 04:41

I just made a word list from the German dump like that:

bzcat pages-articles.xml.bz2 | grep '<title>[^[:space:][:punct:]]*</title>' | sed 's:.*<title>\(.*\)</title>.*:\1:' > words

0人赞添加讨论(0) 举报

Deceive 欺骗

7楼-- · 2019-01-10 04:42

Wiktionary runs on MediaWiki, which has an API.

One of the subpages for the API documentation is Client code, which lists some Python libraries.

0人赞添加讨论(0) 举报

1 2 下一页

Has anyone parsed Wiktionary? [closed]

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间