How to retrieve Wiktionary word content?

Explosion°爆炸

2楼-- · 2020-01-27 09:45

You might want to try JWKTL out. I just found out about it ;)

http://en.wikipedia.org/wiki/Ubiquitous_Knowledge_Processing_Lab#Wiktionary_API

http://www.ukp.tu-darmstadt.de/software/jwktl/

0人赞添加讨论(0) 举报

Lonely孤独者°

3楼-- · 2020-01-27 09:46

Here's a start to parsing etymology and pronunciation data:

function parsePronunciationLine(line) {
  let val
  let type
  line.replace(/\{\{\s*a\s*\|UK\s*\}\}\s*\{\{IPA\|\/?([^\/\|]+)\/?\|lang=en\}\}/, (_, $1) => {
    val = $1
    type = 'uk'
  })
  line.replace(/\{\{\s*a\s*\|US\s*\}\}\s*\{\{IPA\|\/?([^\/\|]+)\/?\|lang=en\}\}/, (_, $1) => {
    val = $1
    type = 'us'
  })
  line.replace(/\{\{enPR|[^\}]+\}\},?\s*\{\{IPA\|\/?([^\/\|]+)\/?\|lang=en}}/, (_, $1) => {
    val = $1
    type = 'us'
  })
  line.replace(/\{\{a|GA\}\},?\s*\{\{IPA\|\/?([^\/\|]+)\/?\|lang=en}}/, (_, $1) => {
    val = $1
    type = 'ga'
  })
  line.replace(/\{\{a|GA\}\},?.+\{\{IPA\|\/?([^\/\|]+)\/?\|lang=en}}/, (_, $1) => {
    val = $1
    type = 'ga'
  })
  // {{a|GA}} {{IPA|/ˈhæpi/|lang=en}}
  // * {{a|RP}} {{IPA|/pliːz/|lang=en}}
  // * {{a|GA}} {{enPR|plēz}}, {{IPA|/pliz/|[pʰliz]|lang=en}}

  if (!val) return

  return { val, type }
}

function parseEtymologyPiece(piece) {
  let parts = piece.split('|')
  parts.shift() // first one is ignored.
  let ls = []
  if (langs[parts[0]]) {
    ls.push(parts.shift())
  }
  if (langs[parts[0]]) {
    ls.push(parts.shift())
  }
  let l = ls.pop()
  let t = parts.shift()
  return [ l, t ]
  // {{inh|en|enm|poisoun}}
  // {{m|enm|poyson}}
  // {{der|en|la|pōtio|pōtio, pōtiōnis|t=drink, a draught, a poisonous draught, a potion}}
  // {{m|la|pōtō|t=I drink}}
  // {{der|en|enm|happy||fortunate, happy}}
  // {{cog|is|heppinn||lucky}}
}

Update: Here is a gist with it more fleshed out.

0人赞添加讨论(0) 举报

Animai°情兽

4楼-- · 2020-01-27 09:47

To keep it really simple, extract the words from the dump like that:

bzcat pages-articles.xml.bz2 | grep '<title>[^[:space:][:punct:]]*</title>' | sed 's:.*<title>\(.*\)</title>.*:\1:' > words

0人赞添加讨论(0) 举报

Bombasti

5楼-- · 2020-01-27 10:04

As mentioned earlier, the problem with this approach is that Wiktionary provides the information about all the words of all the languages. So the approach to check if a page exists using Wikipedia API won't work because there're a lot of pages for non-English words. To overcome this, you need to parse each page to figure out if there's a section describing English word. Parsing wikitext isn't a trivial task, though in your case it's not that bad. To cover almost all the cases you need to just check if the wikitext contains English heading. Depending on the programming language you use, you can find some tools to build AST from wikitext. This will cover most of the cases, but not all of them because Wiktionary includes some common misspellings.

As an alternative, you could try using Lingua Robot or something similar. Lingua Robot parses the Wiktionary content and provide it as a REST API. Non-empty response means that the word exists. Please note that, as opposed to Wiktionary, the API itself doesn't include any misspellings (at least at the moment of writing this answer). Please also note that the Wiktionary contains not only the words, but multi-word expressions.

0人赞添加讨论(0) 举报

一纸荒年 Trace。

6楼-- · 2020-01-27 10:05

If you are using Python, you can use WiktionaryParser by Suyash Behera.

You can install it by

sudo pip install wiktionaryparser

Example usage:

>>> from wiktionaryparser import WiktionaryParser
>>> parser = WiktionaryParser()
>>> word = parser.fetch('test')
>>> another_word = parser.fetch('test', 'french')
>>> parser.set_default_language('french')

0人赞添加讨论(0) 举报

走好不送

7楼-- · 2020-01-27 10:06

There are a few caveats in just checking that Wiktionary has a page with the name you are looking for:

Caveat #1: All Wiktionaries including the English Wiktionary actually have the goal of including every word in every language, so if you simply use above API call you will know that the word you are asking about is a word in at least one language, but not necessarily English: http://en.wiktionary.org/w/api.php?action=query&titles=dicare

Caveat #2: Perhaps a redirect exists from one word to another word. It might be from an alternative spelling, but it might be from an error of some kind. The API call above will not differentiate between a redirect and an article: http://en.wiktionary.org/w/api.php?action=query&titles=profilemetry

Caveat #3: Some Wiktionaries including the English Wiktionary include "common misspellings": http://en.wiktionary.org/w/api.php?action=query&titles=fourty

Caveat #4: Some Wiktionaries allow stub entries which have little or no information about the term. This used to be common on several Wiktionaries but not the English Wiktionary. But it seems to have now spread also to the English Wiktionary: https://en.wiktionary.org/wiki/%E6%99%B6%E7%90%83 (permalink for when the stub is filled so you can still see what a stub looks like: https://en.wiktionary.org/w/index.php?title=%E6%99%B6%E7%90%83&oldid=39757161)

If these are not included in what you want, you will have to load and parse the wikitext itself, which is not a trivial task.

0人赞添加讨论(0) 举报

How to retrieve Wiktionary word content?

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间