scraping data from a dynamic graph using python+be

2020-06-29 01:23发布

问题:

I need to implement a data scraping task and extract data from a dynamic graph. The graph is update with time similar to what you would find if you look at the graph of a company's stock. I am using the requests and beautifulsoup4 library in python but I have only figured out how to scrape text and links data. Can't seem to figure out how i can get the values of the graph into a csv file

The graph in question can be found at - http://www.apptrace.com/app/instagram/id389801252/ranks/topfreeapplications/36

回答1:

@Oliver W. provided a good answer already, but using requests (link here) avoids having to note the network call and is overall a much nicer package than urllib.

If you wanna be a bit more flexible with your code, you can write a function that takes the country name and start and end date.

import requests
import pandas as pd
import json

def load_data(country='', start_date='2014-08-09', end_date='2014-11-1'):
    base = "http://www.apptrace.com/api/app/389801252/rankings/country/"
    extra = "?country={0}&start_date={1}&end_date={2}&device=iphone&list_type=normal&chart_subtype=iphone"
    addr = base + extra.format(country, start_date, end_date)

    page = requests.get(addr)
    json_data = page.json() #gets the json data from the page
    ranks = json_data['rankings'][0]['ranks']
    ranks = json.dumps(ranks)  #Ensures it has valid json format
    df = pd.read_json(ranks, orient='records')
    return df

Change things in the webpage to see what other values you can get from country (Canada is 'CAN' for example). The empty string is for the USA.

The df looks like this

    date        rank
0   2014-08-09  10
1   2014-08-10  10
2   2014-08-11  9
3   2014-08-12  8
4   2014-08-13  8
5   2014-08-14  7
6   2014-08-15  6
7   2014-08-16  8

With the pandas dataframe in hand, you can export to csvor combine many dataframes before you export

df = load_data()
df.to_csv("file_name.csv")


回答2:

The data from the graph can be easily obtained if you have the correct URL. You can find this address rather easily with e.g. the "developer tools" in firefox (check the "Network" tab for the XHR requests).

You'll see calls are being made to e.g.:

src = 'http://www.apptrace.com/api/app/389801252/rankings/country/?country=CAN&start_date=2014-08-08&end_date=&device=iphone&list_type=normal&chart_subtype=iphone'

If you call it, you'll be served a JSON reply which you can easily load into python:

import json
import urllib

>>> data = urllib.urlopen(src).read()
>>> reply = json.loads(data)
>>> ranks = reply['rankings'][0]['ranks']
>>> res = {'date': [], 'rank': []}
>>> for d in ranks:
...     res['date'].append(d['date'])
...     res['rank'].append(d['rank'])
... 
>>> res['date'][:3]
[u'2014-08-08', u'2014-08-09', u'2014-08-10']
>>> res['rank'][:3]
[10, 14, 13]

You can then store the data into a csv using python's builtin csv module.



回答3:

Could you provide a link for reference. It depends how the graph is stored and displayed. Judging by it being dynamic like a stock ticker there should be some text between some tags you can grab somewhere. I have looked at examples of obtaining images and other content from websites using beautiful soup so its not impossible.

Yesterday I was working on formatting the data into CSV format and got some really useful responses pronto.

Check it out:How can I format every other line to be merged with the line before it? (In Python)

Also something I learnt here is if you will need to harvest that data often a good way to run scripts automatically is CRON jobs.