API capture all paginated data? (python)

2020-02-26 02:53发布

问题:

I'm using the requests package to hit an API (greenhouse.io). The API is paginated so I need to loop through the pages to get all the data I want. Using something like:

results = []
for i in range(1,326+1):
    response = requests.get(url, 
                            auth=(username, password), 
                            params={'page':i,'per_page':100})
    if response.status_code == 200:
        results += response.json()

I know there are 326 pages by hitting the headers attribute:

In [8]:
response.headers['link']
Out[8]:
'<https://harvest.greenhouse.io/v1/applications?page=3&per_page=100>; rel="next",<https://harvest.greenhouse.io/v1/applications?page=1&per_page=100>; rel="prev",<https://harvest.greenhouse.io/v1/applications?page=326&per_page=100>; rel="last"'

Is there any way to extract this number automatically? Using the requests package? Or do I need to use regex or something?

Alternatively, should I somehow use a while loop to get all this data? What is the best way? Any thoughts?

回答1:

The python requests library (http://docs.python-requests.org/en/latest/) can help here. The basic steps will be (1) all the request and grab the links from the header (you'll use this to get that last page info), and then (2) loop through the results until you're at that last page.

import requests

results = []

response = requests.get('https://harvest.greenhouse.io/v1/applications', auth=('APIKEY',''))
raw = response.json()  

for i in raw:  
    results.append(i) 

while response.links['next'] != response.links['last']:  
    r = requests.get(r.links['next'], auth=('APIKEY', '')  
    raw = r.json()  
    for i in raw:  
        results.append(i)