R: Scraping aspx with content from doPostBack scri

2019-06-10 06:55发布

问题:

UPDATE 2

Since I made some advance, I opened a new, more precise question: R: scraping data after POST only works for first page

My plan:

I would like to scrape drug informations offered by the Swiss government for an University research project from:

http://www.spezialitaetenliste.ch/ShowPreparations.aspx?searchType=Substance&searchValue=

The page does offer a robotx.txt file, however, it's content is freely available to the public and I assume that scraping this data is unprohibited.

What I've already achieved:

I could manage to scrape the html table of the first search page:

library("rvest")
library("dplyr)")

url<-"http://www.spezialitaetenliste.ch/ShowPreparations.aspx?searchType=Substance&searchValue="

read_html(url) %>%
  html_nodes(xpath='//*[@id="ctl00_cphContent_gvwPreparations"]') %>%
  html_table() %>%
  bind_rows() %>%
  tibble()

Now I would like to scrape the details of the listed drugs (which appears at the page bottom if I click on the link in the Präparat=preparation column). However, this link is not a simple hmtl, it is a doPostBack javascript.

I figured out that these scripts follow this rule:

javascript:__doPostBack('ctl00$cphContent$gvwPreparations$ctl02$ctl00','') javascript:__doPostBack('ctl00$cphContent$gvwPreparations$ctl03$ctl00','') ... javascript:__doPostBack('ctl00$cphContent$gvwPreparations$ctl16$ctl00','')

so

gvw$Preparations$.. = gvw$Preparations$ctl(Nr in the List +1)$ct100

Where I fail:

I tried to implement this solution to ge the detailed drug information:

Scrape website with R by navigating doPostBack

however,

url<-"http://www.spezialitaetenliste.ch/ShowPreparations.aspx?searchType=Substance&searchValue="

pgsession<-html_session(url)
pgform<-html_form(pgsession)[[1]]
page<-rvest:::request_POST(pgsession,"http://www.spezialitaetenliste.ch/ShowPreparations.aspx?searchType=Substance&searchValue=",
                           body=list(
                             `__VIEWSTATE`=pgform$fields$`__VIEWSTATE`$value,
                             `__EVENTTARGET`="ctl00$cphContent$gvwPreparations$ctl02$ctl00",
                             `__EVENTARGUMENT`="",
                             `__VIEWSTATEGENERATOR`=pgform$fields$`__VIEWSTATEGENERATOR`$value,
                             `__VIEWSTATEENCRYPTED`=pgform$fields$`__VIEWSTATEENCRYPTED`$value,
                             `__EVENTVALIDATION`=pgform$fields$`__EVENTVALIDATION`$value
                           ),
                           encode="form"
)


read_html(page) %>% html_nodes(css="ctl00_cphContent_fvwPreparation") 

gives me {xml_nodeset (0)}

I assume that my request_POST is not correct... However, I could not figure out what's wrong.

So the open points are:

  • What is the correct way to get the detailed information of the doPostBack java scripts (i.e. the table that appears at the end of the page if one clicks on the link in the first column)

  • How can I get the results of the other pages (935)? Do I have to use RSelenium and click through the results, if yes, how can the RSelenium and rvest approach be combined? Or is there an easier way to do so?

UPDATE

I could solve the first point (at least partially) with this help from hrbrmstr:

https://www.queryxchange.com/q/27_51801321/getting-xml-nodeset-0-when-using-html-nodes-from-rvest-package-in-r/

read_html(page) %>% html_nodes(xpath = '//*[@id="ctl00_cphContent_fvwPreparation"]') %>% html_text

which gives me the detailed information (in a bit an unstructured form).

however, I still do not know, how to get the info from all other pages.

With RSelenium I would do something like:

library("RSelenium")

# star selenum
sel <- remoteDr(browserName = "chrome", port = 4445L)

# go to the URL 
sel %>% 
  go("http://www.spezialitaetenliste.ch/ShowPreparations.aspx?searchType=Substance&searchValue=")


# chose max page
sel %>% 
  findElement(using = 'xpath', "//*/option[@value = '100']") %>%  # find the submit button 
  elementClick()  # click it 

However, I do not know how to combine RSelenium and rvest

Is it possible to select the maximum of displayed pages with the URL, something like

http://www.spezialitaetenliste.ch/ShowPreparations.aspx?PageSize=500

And then chose the next page with

http://www.spezialitaetenliste.ch/ShowPreparations.aspx?PageSize=500&PageNr=2

UPDATE 2

Since I made some advance, I opened a new, more precise question: R: scraping data after POST only works for first page