Use BeautifulSoup to loop through and retrieve spe

I want to use BeautifulSoup and retrieve specific URLs at specific position repeatedly. You may imagine that there are 4 different URL lists each containing 100 different URL links.

I need to get and print always the 3rd URL on every list, while the previous URL (e.g. the 3rd URL on the first list) will lead to the 2nd list (and then need to get and print the 3rd URL and so on till the 4th retrieval).

Yet, my loop only achieves the first result (3rd URL on list 1), and I don't know how to loop the new URL back to the while loop and continue the process.

Here is my code:

import urllib.request
import json
import ssl
from bs4 import BeautifulSoup


num=int(input('enter count times: ' ))
position=int(input('enter position: ' ))

url='https://pr4e.dr-chuck.com/tsugi/mod/python-   
data/data/known_by_Fikret.html'
print (url)

count=0
order=0
while count<num:
    context = ssl._create_unverified_context()
    htm=urllib.request.urlopen(url, context=context).read()
    soup=BeautifulSoup(htm)
    for i in soup.find_all('a'):
        order+=1
        if order ==position:
            x=i.get('href')
            print (x)
    count+=1
    url=x        
print ('done')

标签： python loops url beautifulsoup

2条回答

女痞

2楼-- · 2019-08-12 01:20

Just get the link from find_all() by index:

while count < num:
    context = ssl._create_unverified_context()
    htm = urllib.request.urlopen(url, context=context).read()

    soup = BeautifulSoup(htm)
    url = soup.find_all('a')[position].get('href')

    count += 1

0人赞添加讨论(0) 举报

爱情/是我丢掉的垃圾

3楼-- · 2019-08-12 01:42

This is a good problem to use recursion. Try to call a recursive function to do this:

def retrieve_urls_recur(url, position, index, deepness):
    if index >= deepness:
        return True
    else:
        plain_text = requests.get(url)
        soup = BeautifulSoup(plain_text)
        links = soup.find_all('a'):
        desired_link = links[position].get('href')
        print desired_link
        return retrieve_urls_recur(desired_link, index+1, deepness)

and then call it with the desired parameters, in your case:

retrieve_urls_recur(url, 2, 0, 4)

2 is the url index on the list of urls, 0 is the counter, and 4 is how deep you want to go recursively

ps: I am using requests instead of urllib, and I didnt test this, although I recentely used a very similar function with sucess

0人赞添加讨论(0) 举报

Use BeautifulSoup to loop through and retrieve spe

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间