downloading morningstar webpages for screenscrapin

2019-04-13 05:21发布

问题:

I'd like to be able to screenscrape Morningstar webpages. Morningstar provides information about a mutual fund that I routinely look up but haven't been able to find elsewhere, ie

  1. total return compared against benchmark
  2. total return compared against peers
  3. percentile ranking

Here's an example: morningstar example

As a prelude to screenscraping, I need to be able to download the webpage with the desired content. Unfortunately, when I try using Java SE6 or wget to retrieve the above example link, I only get a portion of the html (the tables displaying the total return figures are absent). I get the same result, if I use my browser (Chrome), to save the page as html only. I notice that if I use my browser to save the complete page (html, js, css, and everything else) the downloaded html does contain the interesting information.

I have two questions:

  1. How can I programmatically download the entire html file? Though I'm writing this program in Java, I don't mind invoking an external tool.
  2. Why were my aforementioned attempts not yielding the HTML that I was expecting?

Thanks.

As a side note, I looked at Yahoo Finance and YQL/datatables as alternatives but that Yahoo Finance doesn't provide percentile rankings. If you look up the performance of a mutual fund, you'll see N/A values for the rankings. Yahoo Finance example. Unfortunately, this would preclude using YQL/datatables.

Regarding any questions of Morningstar's copyright, I'm screenscraping for personal, non commercial use, which their copyright notice allows in the last sentence of the second paragraph:

You are entitled to use the Information it contains for your private, non-commercial use only. Morningstar Copyright.

回答1:

To download the morningstar webpage, I needed a tool that would download and interpret the javascript code associated with the webpage. Many such tools for different programming languages and browsers are mentioned on StackOverflow. Here are the ones that I wound up using:

  • htmlunit - a GUI-less browser for Java programs
  • htmlunitscripter - a firefox add-on that autogenerates htmlunit code


回答2:

So the page makes extensive use of XMLHttpRequest to populate data which means that your scraper will have to perform javascript evaluation. If you use the developer tools in Chrome you can see the HTML used to construct the page and the JSON data used to build the tables.

For scraping this I would try to use Internet Explorer as it can host the whole page inside of it and perform javascript evaluation. There are probably other ways to use APIs such as WebKit but IE should work right out of the box.



回答3:

Have you tried irobot at http://irobotsoft.com? You can verify with this:

  • Go to the url
  • Mark the data of interest
  • Add a take data action
  • Test the action and see if it extracts the data you want

They have a forum where you can ask general screenscraping questions