Scraping from aspx website using R

2020-02-09 14:33发布


I am trying to accomplish a task using R to scrape data on a website.

  1. I would like to go through each link on the following page: Bills

  2. Select only items with Current Status showing "transmitted to the governor". For example,

  3. And then scrapping the cells within STATUS TEXT for the following clause" Passed Final Reading". For example: Passed Final Reading as amended in SD 2 with Representative(s) Fale, Jordan, Tsuji voting aye with reservations; Representative(s) Cabanilla, Morikawa, Oshiro, Tokioka voting no (4) and none excused (0).

I have tried using previous examples with packages Rcurl and XML (in R), but I don't know how to use them correctly for aspx sites. So what I would love to have is: 1. Some suggestion on how to build such a code. 2. And recommendation for how to learn the knowledge needed for performing such a task.

Thanks for any help,




basePage <- ""

h <- handle(basePage)

GET(handle = h)

res <- GET(handle = h, path = "/advreports/advreport.aspx?year=2013&report=deadline&rpt_type=&measuretype=hb&title=House")

# parse content for "Transmitted to Governor" text
resXML <- htmlParse(content(res, as = "text"))
resTable <- getNodeSet(resXML, '//*/table[@id ="GridViewReports"]/tr/td[3]')
appRows <-sapply(resTable, xmlValue)
include <- grepl("Transmitted to Governor", appRows)
resUrls <- xpathSApply(resXML, '//*/table[@id ="GridViewReports"]/tr/td[2]//@href')

appUrls <- resUrls[include]

# look at just the first

res <- GET(handle = h, path = appUrls[1])

resXML <- htmlParse(content(res, as = "text"))

xpathSApply(resXML, '//*[text()[contains(.,"Passed Final Reading")]]', xmlValue)

[1] "Passed Final Reading as amended in SD 2 with Representative(s) Fale, Jordan,
 Tsuji voting aye with reservations; Representative(s) Cabanilla, Morikawa, Oshiro,
 Tokioka voting no (4) and none excused (0)."

Let package httr handle all the background work by setting up a handle.

If you want to run over all 92 links:

 # get all the links returned as a list (will take sometime)
 # print statement included for sanity
 res <- lapply(appUrls, function(x){print(sprintf("Got url no. %d",which(appUrls%in%x)));
                                   GET(handle = h, path = x)})
 resXML <- lapply(res, function(x){htmlParse(content(x, as = "text"))})
 appString <- sapply(resXML, function(x){
                   xpathSApply(x, '//*[text()[contains(.,"Passed Final Reading")]]', xmlValue)


>  head(appString)
[1] "Passed Final Reading as amended in SD 2 with Representative(s) Fale, Jordan, Tsuji voting aye with reservations; Representative(s) Cabanilla, Morikawa, Oshiro, Tokioka voting no (4) and none excused (0)."

[1] "Passed Final Reading, as amended (CD 1). 25 Aye(s); Aye(s) with reservations: none . 0 No(es): none.  0 Excused: none."                                                  
[2] "Passed Final Reading as amended in CD 1 with Representative(s) Cullen, Har voting aye with reservations; Representative(s) McDermott voting no (1) and none excused (0)."

[1] "Passed Final Reading, as amended (CD 1). 25 Aye(s); Aye(s) with reservations: none . 0 No(es): none.  0 Excused: none."                                 
[2] "Passed Final Reading as amended in CD 1 with none voting aye with reservations; Representative(s) Hashem, McDermott voting no (2) and none excused (0)."

[1] "Passed Final Reading, as amended (CD 1). 24 Aye(s); Aye(s) with reservations: none . 0 No(es): none.  1 Excused: Ige."                    
[2] "Passed Final Reading as amended in CD 1 with none voting aye with reservations; none voting no (0) and Representative(s) Say excused (1)."

[1] "Passed Final Reading, as amended (CD 1). 25 Aye(s); Aye(s) with reservations: none . 0 No(es): none.  0 Excused: none."                        
[2] "Passed Final Reading as amended in CD 1 with Representative(s) Johanson voting aye with reservations; none voting no (0) and none excused (0)."

[1] "Passed Final Reading, as amended (CD 1). 25 Aye(s); Aye(s) with reservations: none . 0 No(es): none.  0 Excused: none."  
[2] "Passed Final Reading as amended in CD 1 with none voting aye with reservations; none voting no (0) and none excused (0)."


I am trying to read the page with .aspx & i followed your path but i am not able to get the data in R.

Here is query,

res <- GET(handle = h, path = "")

parse content Top 10 Holding

resXML <- htmlParse(content(res, as = "text"))

resTable <- getNodeSet(resXML, '/*[@id="quotePageContent"]/div/div/div[2]/div/div[2]/div[2]/div[1]/div[1]/table/tbody/tr[12]/tr')

appRows <-sapply(resTable, xmlValue)

include <- grepl("Top 10 Holding", appRows)

And results are as follows,

resXML <- htmlParse(content(res, as = "text"))

resTable <- getNodeSet(resXML, '/*[@id="quotePageContent"]/div/div/div[2]/div/div[2]/div[2]/div[1]/div[1]/table/tbody/tr[12]/tr')

appRows <-sapply(resTable, xmlValue)

include <- grepl("Top 10 Holding", appRows)



appRows list() resTable NULL