how to scrape multiple pages from one site

2019-08-30 22:48发布

问题:

I want to scrap multiple pages from one site.the pattern like this:

https://www.example.com/S1-3-1.html https://www.example.com/S1-3-2.html https://www.example.com/S1-3-3.html https://www.example.com/S1-3-4.html https://www.example.com/S1-3-5.html.

I tried three method to scrape all of these pages once, but every method only scrape the first page. I show the code below, and anyone can check and tell me what is the problem will be highly appreciated.

 ===============method 1====================
    import requests  
    for i in range(5):      # Number of pages plus one 
        url = "https://www.example.com/S1-3-{}.html".format(i)
        r = requests.get(url)
    from bs4 import BeautifulSoup  
    soup = BeautifulSoup(r.text, 'html.parser')  
    results = soup.find_all('div', attrs={'class':'product-item item-template-0 alternative'})
    ===============method 2=============
    import urllib2,sys
    from bs4 import BeautifulSoup
    for numb in ('1', '5'):
        address = ('https://www.example.com/S1-3-' + numb + '.html')
    html = urllib2.urlopen(address).read()
    soup = BeautifulSoup(html,'html.parser')
    results = soup.find_all('div', attrs={'class':'product-item item-template-0 alternative'})
    =============method 3==============
    import requests 
    from bs4 import BeautifulSoup  
    url = 'https://www.example.com/S1-3-1.html'
    for round in range(5):
        res = requests.get(url)
        soup = BeautifulSoup(res.text,'html.parser')
        results = soup.find_all('div', attrs={'class':'product-item item-template-0 alternative'})
        paging = soup.select('div.paging a')
        next_url = 'https://www.example.com/'+paging[-1]['href'] # paging[-1]['href'] is next page button on the page 
        url = next_url

I checked some answers and checked, but it is not loop problem, please check image shown below,it is only first page results. it is really me annoyed several days please see photo:only first page results, results picture 2

回答1:

Your indentation is out of order.

try(Method 1)

from bs4 import BeautifulSoup 
import requests

for i in range(1, 6):      # Number of pages plus one 
    url = "https://www.example.com/S1-3-{}.html".format(i)
    r = requests.get(url)
    soup = BeautifulSoup(r.text, 'html.parser')  
    results = soup.find_all('div', attrs={'class':'product-item item-template-0 alternative'})


回答2:

Your page analysis should be inside the loop, like this, otherwise, it will only use one page:

.......
    for i in range(5):      # Number of pages plus one 
        url = "https://www.example.com/S1-3-{}.html".format(i)
        r = requests.get(url)
        from bs4 import BeautifulSoup  
        soup = BeautifulSoup(r.text, 'html.parser')  
        results = soup.find_all('div', attrs={'class':'product-item item-template-0 alternative'})
........


回答3:

Firstly, you have to introduce all orders inside of the loop, otherwise, only will work with the last iteration.

Second, You could try closing the requests session at the end of each iteration:

import requests  
    for i in range(5):      # Number of pages plus one 
    url = "https://www.example.com/S1-3-{}.html".format(i)
    r = requests.get(url)
    from bs4 import BeautifulSoup  
    soup = BeautifulSoup(r.text, 'html.parser')  
    results = soup.find_all('div', attrs={'class':'product-item item-template-0 alternative'})
    r.close()


标签: python scrape