Wiktionary is a wiki dictionary that covers many languages. It even has translations. I would be interested in parsing it and playing with the data, has anyone does anything like this before? Is there any library I can use? (Preferably Python.)
相关问题
- how to define constructor for Python's new Nam
- streaming md5sum of contents of a large remote tar
- How to get the background from multiple images by
- Evil ctypes hack in python
- Correctly parse PDF paragraphs with Python
wordnik has done a good job parsing-out definitions, etc and they have a great api
like the others have mentioned, wiktionary is a formatting-disaster, and was not built to be computer-readable
You may be interested in dbnary project, not python but interesting. Claims support parsing for 21 languages and it powers wikdict.
It depends on how thoroughly you need to parse it. If you just need to get all contents of a word in a language (definition, etymology, pronunciation, conjugation, etc.) then it's pretty easy. I had done this before, although in Java using jsoup
However, if you need to parse it down to different components of the content (e.g. just getting the definitions of a word), then it will be much more challenging. A Wiktionary entry for a word in a language has no pre-defined template, so a header can be anything from
<h3>
to<h6>
, the order of the sections may be jumbled, they can be repetitive, etc.I wrote a primitive parser for the German Wiktionary dump in Java that only extracts nouns and their articles, plus their Arabic translation, without any dependencies. Execution takes a long time, so be warned. If there’s interest/need to parse more or other data, please tell me, I might look into it as time permits.
I just made a word list from the German dump like that:
Wiktionary runs on MediaWiki, which has an API.
One of the subpages for the API documentation is Client code, which lists some Python libraries.