This is related to this question. I was trying to query the Glassdoor public API using the parameters documented, but kept getting a 403 Forbidden response. To make sure that the query parameters were being used to create the URL correctly, I took the composed query URL and tried it in my browser and it worked.
Working backwards from the query that my browser was making, I managed to figure out that the user agent needs to not only be a parameter in the URL, but also needs to be passed in the header.
So putting this all together, here is code that will query the Glassdoor public API succcessfully:
import urllib.request as request
import requests
import json
from collections import OrderedDict
# authentication information & other request parameters
params_gd = OrderedDict({
"v": "1",
"format": "json",
"t.p": "xxxxxx",
"t.k": "yyyyyyyy",
"action": "employers",
"employerID": "11111",
# programmatically get the IP of the machine
"userip": json.loads(request.urlopen("http://ip.jsontest.com/").read().decode('utf-8'))['ip'],
"useragent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.81 Safari/537.36"
})
# construct the URL from parameters
basepath_gd = 'http://api.glassdoor.com/api/api.htm'
# request the API
response_gd = requests.get(basepath_gd,
params=params_gd,
headers={
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.81 Safari/537.36"
})
# check the response code (should be 200) & the content
response_gd
response_gd.content
My question is -- why does the User-Agent
need to be specified in the query header when it is already a part of the URL parameters? Shouldn't the query work without the user agent header?
fg,
Some providers don't like serving data to automated tools that may simply be scraping their data... one of the ways they "can tell" that they're serving a "person" and not some sort of whacky Python script is by checking the User-Agent header normally applied by the browser.
In this specific instance, Glassdoor has published their API Terms here, and from the top of page three they state "We reserve the right to limit or block applications that make a large number of calls to the Glassdoor API that are not primarily in response to the direct actions of individual end users."
I'm inclined to think that this is enforced by looking for Header: User-Agent, but most companies will not explicitly state how they enforce this. They also require that you display their logo and link to their home page on the approved webpage/site on which you display their data.
Hope this helps.