I want to scrape the data from a page that shows a graph using highcharts.js
, and thus I finished to parse all the pages to get to the following page. However, the last page, the one that displays the dataset, uses highcharts.js
to show the graph, which it seems to be near impossible to access to the raw data.
I use Python 3.5 with BeautifulSoup.
Is it still possible to parse it? If so how can I scrape it?
The data is in a script tag. You can get the script tag using bs4 and a regex. You could also extract the data using a regex but I like using /js2xml to parse js functions into a xml tree:
That gives you:
So to get all the data:
Like I said you could just use a regex but js2xml I find is more reliable as erroneous spaces etc.. won't break it.