I need to implement a data scraping task and extract data from a dynamic graph. The graph is update with time similar to what you would find if you look at the graph of a company's stock. I am using the requests and beautifulsoup4 library in python but I have only figured out how to scrape text and links data. Can't seem to figure out how i can get the values of the graph into a csv file
The graph in question can be found at - http://www.apptrace.com/app/instagram/id389801252/ranks/topfreeapplications/36
@Oliver W. provided a good answer already, but using requests
(link here) avoids having to note the network call and is overall a much nicer package than urllib
.
If you wanna be a bit more flexible with your code, you can write a function that takes the country name and start and end date.
import requests
import pandas as pd
import json
def load_data(country='', start_date='2014-08-09', end_date='2014-11-1'):
base = "http://www.apptrace.com/api/app/389801252/rankings/country/"
extra = "?country={0}&start_date={1}&end_date={2}&device=iphone&list_type=normal&chart_subtype=iphone"
addr = base + extra.format(country, start_date, end_date)
page = requests.get(addr)
json_data = page.json() #gets the json data from the page
ranks = json_data['rankings'][0]['ranks']
ranks = json.dumps(ranks) #Ensures it has valid json format
df = pd.read_json(ranks, orient='records')
return df
Change things in the webpage to see what other values you can get from country (Canada is 'CAN' for example). The empty string is for the USA.
The df looks like this
date rank
0 2014-08-09 10
1 2014-08-10 10
2 2014-08-11 9
3 2014-08-12 8
4 2014-08-13 8
5 2014-08-14 7
6 2014-08-15 6
7 2014-08-16 8
With the pandas dataframe in hand, you can export to csv
or combine many dataframes before you export
df = load_data()
df.to_csv("file_name.csv")
The data from the graph can be easily obtained if you have the correct URL. You can find this address rather easily with e.g. the "developer tools" in firefox (check the "Network" tab for the XHR requests).
You'll see calls are being made to e.g.:
src = 'http://www.apptrace.com/api/app/389801252/rankings/country/?country=CAN&start_date=2014-08-08&end_date=&device=iphone&list_type=normal&chart_subtype=iphone'
If you call it, you'll be served a JSON reply which you can easily load into python:
import json
import urllib
>>> data = urllib.urlopen(src).read()
>>> reply = json.loads(data)
>>> ranks = reply['rankings'][0]['ranks']
>>> res = {'date': [], 'rank': []}
>>> for d in ranks:
... res['date'].append(d['date'])
... res['rank'].append(d['rank'])
...
>>> res['date'][:3]
[u'2014-08-08', u'2014-08-09', u'2014-08-10']
>>> res['rank'][:3]
[10, 14, 13]
You can then store the data into a csv using python's builtin csv module.
Could you provide a link for reference. It depends how the graph is stored and displayed. Judging by it being dynamic like a stock ticker there should be some text between some tags you can grab somewhere. I have looked at examples of obtaining images and other content from websites using beautiful soup so its not impossible.
Yesterday I was working on formatting the data into CSV format and got some really useful responses pronto.
Check it out:How can I format every other line to be merged with the line before it? (In Python)
Also something I learnt here is if you will need to harvest that data often a good way to run scripts automatically is CRON jobs.