Use BeautifulSoup to loop through and retrieve spe

2019-08-12 00:51发布

I want to use BeautifulSoup and retrieve specific URLs at specific position repeatedly. You may imagine that there are 4 different URL lists each containing 100 different URL links.

I need to get and print always the 3rd URL on every list, while the previous URL (e.g. the 3rd URL on the first list) will lead to the 2nd list (and then need to get and print the 3rd URL and so on till the 4th retrieval).

Yet, my loop only achieves the first result (3rd URL on list 1), and I don't know how to loop the new URL back to the while loop and continue the process.

Here is my code:

import urllib.request
import json
import ssl
from bs4 import BeautifulSoup


num=int(input('enter count times: ' ))
position=int(input('enter position: ' ))

url='https://pr4e.dr-chuck.com/tsugi/mod/python-   
data/data/known_by_Fikret.html'
print (url)

count=0
order=0
while count<num:
    context = ssl._create_unverified_context()
    htm=urllib.request.urlopen(url, context=context).read()
    soup=BeautifulSoup(htm)
    for i in soup.find_all('a'):
        order+=1
        if order ==position:
            x=i.get('href')
            print (x)
    count+=1
    url=x        
print ('done')

2条回答
女痞
2楼-- · 2019-08-12 01:20

Just get the link from find_all() by index:

while count < num:
    context = ssl._create_unverified_context()
    htm = urllib.request.urlopen(url, context=context).read()

    soup = BeautifulSoup(htm)
    url = soup.find_all('a')[position].get('href')

    count += 1
查看更多
爱情/是我丢掉的垃圾
3楼-- · 2019-08-12 01:42

This is a good problem to use recursion. Try to call a recursive function to do this:

def retrieve_urls_recur(url, position, index, deepness):
    if index >= deepness:
        return True
    else:
        plain_text = requests.get(url)
        soup = BeautifulSoup(plain_text)
        links = soup.find_all('a'):
        desired_link = links[position].get('href')
        print desired_link
        return retrieve_urls_recur(desired_link, index+1, deepness) 

and then call it with the desired parameters, in your case:

retrieve_urls_recur(url, 2, 0, 4)

2 is the url index on the list of urls, 0 is the counter, and 4 is how deep you want to go recursively

ps: I am using requests instead of urllib, and I didnt test this, although I recentely used a very similar function with sucess

查看更多
登录 后发表回答