I am trying to accomplish a task using R to scrape data on a website.
I would like to go through each link on the following page: http://capitol.hawaii.gov/advreports/advreport.aspx?year=2013&report=deadline&rpt_type=&measuretype=hb&title=House Bills
Select only items with Current Status showing "transmitted to the governor". For example, http://capitol.hawaii.gov/measure_indiv.aspx?billtype=HB&billnumber=17&year=2013
And then scrapping the cells within STATUS TEXT for the following clause" Passed Final Reading". For example: Passed Final Reading as amended in SD 2 with Representative(s) Fale, Jordan, Tsuji voting aye with reservations; Representative(s) Cabanilla, Morikawa, Oshiro, Tokioka voting no (4) and none excused (0).
I have tried using previous examples with packages Rcurl and XML (in R), but I don't know how to use them correctly for aspx sites. So what I would love to have is: 1. Some suggestion on how to build such a code. 2. And recommendation for how to learn the knowledge needed for performing such a task.
Thanks for any help,
Tom
Let package
httr
handle all the background work by setting up ahandle
.If you want to run over all 92 links:
I am trying to read the page with .aspx & i followed your path but i am not able to get the data in R.
Here is query,
res <- GET(handle = h, path = "https://www.morningstar.in/mutualfunds/f0gbr06rnd/hdfc-medium-term-debt-plan-growth/detailed-portfolio.aspx")
parse content Top 10 Holding
resXML <- htmlParse(content(res, as = "text"))
resTable <- getNodeSet(resXML, '/*[@id="quotePageContent"]/div/div/div[2]/div/div[2]/div[2]/div[1]/div[1]/table/tbody/tr[12]/tr')
appRows <-sapply(resTable, xmlValue)
include <- grepl("Top 10 Holding", appRows)
And results are as follows,
logical(0)