There is this fancy infobox in <some Wikipedia article>. How do I get the value of <this field and that>?
相关问题
- Getting list of persons using SPARQL dbpedia
- Getting search results from wikidata website, but
- How to get the terminal leaves of a Wikipedia root
- Scrape Data from Wikipedia
- extract unidentified html content from between two
相关文章
- scraping data from wikipedia table
- Json deserialize from wikipedia api with c#
- Get location with Wikimedia API
- How to access wikipedia
- How to get all URLs in a Wikipedia page
- How to get abstract and thumbnail of a Wikipedia a
- Training times for Spacy Entity Linking model
- Query Wikipedia pages with properties
The wrong way: trying to parse HTML
This is actually a really bad idea most of the time. Wikipedia's HTML code is not particularly parsing-friendly (especially infoboxes which are a system of hand-written templates), the exact structure changes from infobox to infobox, and the structure of an infobox might change over time. You might also miss out on some features that would be otherwise available, such as internationalization.
The other wrong way: trying to parse wikitext
At a glance, the wikitext of some articles looks like it's a pretty straightforward representation of the infobox:
In reality, that's not the case. Templates are "recursive" so you might run into stuff like
param1 = {{convert|10|km|mi}}
; template parameters might contain complex wikitext or HTML markup; some parameters might be missing from the article wikitext and fetched by the template from a subpage or other data repository. Just finding out where a parameter starts and ends might not be a simple business if it contains other templates which have their own parameters.The ideal way: using a structured data source
There are various projects to provide the information contained in Wikipedia infoboxes in a structured form; the two large ones are Wikidata and DBpedia.
Wikidata is a project to build a knowledge base containing structured data; it is maintained by the same global movement that built Wikipedia, so information is in the process of being moved over. This is a manual process, so not all information in Wikipedia is available via Wikidata, on the other hand there is a lot of information that's in Wikidata but not in Wikipedia. You can find the Wikidata page of an article and see what information it contains by following the Wikidata item link in the left-hand toolbar on the article page; programmatically, you can access Wikidata information using the wbgetentities API module (sandbox, explanation of concepts), e.g. wikidata.org/w/api.php?action=wbgetentities&sites=enwiki&titles=Albert_Einstein. There is also a SPARQL endpoint, database dumps, and clients in PHP, Java and Python.
DBPedia is a project to harvest Wikipedia infobox information by automated means and publish it in a structured form. You can find the DBPedia page for a Wikipedia article by going to
http://dbpedia.org/page/<Wikipedia article name>
, e.g. http://dbpedia.org/page/Albert_Einstein. It has many data formats, dumps, a SPARQL endpoint and various other things.The wrong ways done right
If the information you need is not available via Wikidata or DBpedia, there are still semi-structured ways of extracting data from infoboxes. For HTML-based extraction you can use Wikipedia's REST content API (e.g. https://en.wikipedia.org/api/rest_v1/page/html/Albert_Einstein) which returns a richer, more semantic HTML than the one used on normal article pages, and preserves in it some information about template structure.
Alternatively, you might start from wikitext and parse it into a syntax tree using the simpler, client-side
mwparserfromhell
Python module (docs) or the more powerful Parsoid JS API which interacts with the Wikipedia REST content service.A higher-level Python library which tries to extract infobox contents from wikitext is
wptools
.