How to get all URLs in a Wikipedia page

It seems like Wikipedia API's definition of a link is different from URL? I'm trying to use the API to return all the urls in a specific wiki page.

I have been playing around with this query that I found from this page under generators and redirects.

标签： wikipedia-api

2条回答

一夜七次

2楼-- · 2020-07-18 09:32

I'm not sure why exactly are you confused (it would help if you explained that), but I'm quite sure that query is not what you want. It lists links (prop=links) on pages that are linked (generator=links) from the page “Title” (titles=Title). It also lists only the first page of links on the first page of links (with page size the tiny default value of 10).

If you want to get all the links on the page “Title”:

Use just prop=links, you don't want the generator.
Increase the limit to the maximum possible by adding pllimit=max (pl is the “prefix” for links)
Use the value given in the query-continue element to get to the second (and following) page of results.

So, the query for the first page would be:

http://en.wikipedia.org/w/api.php?action=query&titles=Title&prop=links&pllimit=max

And the second (and in this case, final) page:

http://en.wikipedia.org/w/api.php?action=query&titles=Title&prop=links&pllimit=max&plcontinue=226160|0|Lieutenant_General

Another thing that might be confusing you is that links returns only internal links (to other Wikipedia pages). To get external links, use prop=extlinks. You can also combine the two into one query:

http://en.wikipedia.org/w/api.php?action=query&titles=Title&prop=links|extlinks

0人赞添加讨论(0) 举报

一夜七次

3楼-- · 2020-07-18 09:49

Here's a Python solution that gets (and prints) all the pages linked to from a particular page. It gets the maximum number of links in the first request, then looks to see if the returned JSON object has a "continue" property. If it does, it adds the "plcontinue" value to the params dictionary and makes another request. (The last page of results returned will not have this property.)

import requests

session = requests.Session()

url = "https://en.wikipedia.org/w/api.php"
params = {
    "action": "query",
    "format": "json",
    "titles": "Albert Einstein",
    "prop": "links",
    "pllimit": "max"
}

response = session.get(url=url, params=params)
data = response.json()
pages = data["query"]["pages"]

pg_count = 1
page_titles = []

print("Page %d" % pg_count)
for key, val in pages.items():
    for link in val["links"]:
        print(link["title"])
        page_titles.append(link["title"])

while "continue" in data:
    plcontinue = data["continue"]["plcontinue"]
    params["plcontinue"] = plcontinue

    response = session.get(url=url, params=params)
    data = response.json()
    pages = data["query"]["pages"]

    pg_count += 1

    print("\nPage %d" % pg_count)
    for key, val in pages.items():
        for link in val["links"]:
            print(link["title"])
            page_titles.append(link["title"])

print("%d titles found." % len(page_titles))

This code was adapted from the code in the MediaWiki API:Links example.

0人赞添加讨论(0) 举报

How to get all URLs in a Wikipedia page

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间