This is the link I want to scrape: http://www.prudential.com.hk/PruServlet?module=fund&purpose=searchHistFund&fundCd=MMFU_U
The "English Version" tab is at the upper right hand corner in order to show the English version of the web page.
There is a button I have to press in order to read the funds information on the web page. If not, the view is blocked, and using scrapy shell always result empty [].
<div onclick="AgreeClick()" style="width:200px; padding:8px; border:1px black solid;
background-color:#cccccc; cursor:pointer;">Confirmed</div>
And the function of AgreeClick is:
function AgreeClick() {
var cookieKey = "ListFundShowDisclaimer";
SetCookie(cookieKey, "true", null);
Get("disclaimerDiv").style.display = "none";
Get("blankDiv").style.display = "none";
Get("screenDiv").style.display = "none";
//Get("contentTable").style.display = "block";
ShowDropDown();
How do I overcome this onclick="AgreeClick()" function to scrape the web page?
Use the spynner library for Python to emulate a browser and execute the client-side javascript.
As you can see, you can invoke any Javascript function available in the source of the page programmatically.
If you also need to parse results, I highly recommend BeautifulSoup.
You cannot just click the link inside scrapy (see Click a Button in Scrapy).
First of all, check if the data you need is already there - in the html (it is on the background - so it's there).
Another option is selenium:
One more option is to use mechanize. It cannot execute js code, but, according to the source code,
AgreeClick
just sets the cookieListFundShowDisclaimer
totrue
. This is a starting point (not sure if it works):Then, you can parse the result with
BeautifulSoup
or whatever you prefer.