How to use load more option with a non head web sc

2019-08-25 02:19发布

I am trying to download the location details from Instagram using URL scrape, but I am not able use Load more option to scrape more locations from the URLs.

I appreciate suggestions on how to modify the code, or which new code block I need to use to get all the locations available in that particular url.

Code:

import re
import requests
import json
import pandas as pd
import numpy as np
import csv
from geopy.geocoders import Nominatim

def Location_city(F_name):
    path="D:\\Everyday_around_world\\instagram\\"
    filename=path+F_name
    url1="https://www.instagram.com/explore/locations/c1027234/hyderabad-india/"
    r = requests.get(url1)
    df3=pd.DataFrame()
    match = re.search('window._sharedData = (.*);</script>', r.text)
    a= json.loads(match.group(1))
    b=a['entry_data']['LocationsDirectoryPage'][0]['location_list']
    for j in range(0,len(b)):
        z= b[j]
        if all(ord(char) < 128 for char in z['name'])==True:
            x=str(z['name'])
            print (x)
            geolocator = Nominatim()
            location = geolocator.geocode(x,timeout=10000)
            if location!=None:
                #print((location.latitude, location.longitude))
                df3 = df3.append(pd.DataFrame({'name': z['name'], 'id':z['id'],'latitude':location.latitude,
                                       'longitude':location.longitude},index=[0]), ignore_index=True)
    df3.to_csv(filename,header=True,index=False)
Location_city("Hyderabad_locations.csv")

Thanks in advance for the help ..

1条回答
Juvenile、少年°
2楼-- · 2019-08-25 02:26

The url for the instagram "see more" button I think you may be describing adds a page number to the url you are scraping like so: https://www.instagram.com/explore/locations/c1027234/hyderabad-india/?page=2

You can add a counter that iterates to mimic increasing the page number and loop through as long as you continue to receive results back. I add a try, except to watch for the KeyError thrown when there are no more results, then set conditions to exit loops and write the dataframe to csv.

Modified code:

import re
import requests
import json
import pandas as pd
import numpy as np
import csv
from geopy.geocoders import Nominatim

def Location_city(F_name):
    path="D:\\Everyday_around_world\\instagram\\"
    filename=path+F_name
    url1="https://www.instagram.com/explore/locations/c1027234/hyderabad-india/?page="
    pageNumber = 1
    r = requests.get(url1+ str(pageNumber)) #grabs page 1
    df3=pd.DataFrame()
    searching = True
    while searching:
        match = re.search('window._sharedData = (.*);</script>', r.text)
        a= json.loads(match.group(1))
        try:
            b=a['entry_data']['LocationsDirectoryPage'][0]['location_list']
        except KeyError: # 
            print "No more locations returned"
            searching = False # will exit while loop
            b = [] # avoids duplicated from previous results
        if len(b) > 0: # skips this section if there are no results
            for j in range(0,len(b)):
                z= b[j]
                if all(ord(char) < 128 for char in z['name'])==True:
                    x=str(z['name'])
                    print (x)
                    geolocator = Nominatim()
                    location = geolocator.geocode(x,timeout=10000)
                    if location!=None:
                        #print((location.latitude, location.longitude))
                        df3 = df3.append(pd.DataFrame({'name': z['name'], 'id':z['id'],'latitude':location.latitude,
                                       'longitude':location.longitude},index=[0]), ignore_index=True)
        pageNumber += 1
        next = url1 + str(pageNumber) # increments url
        r = requests.get(next) # gets results for next url
    df3.to_csv(filename,header=True,index=False) #When finished looping through pages, write dataframe to csv
Location_city("Hyderabad_locations.csv")
查看更多
登录 后发表回答