harvesting data via drop down list in R

2019-06-13 04:08发布

I am trying to harvest data from this website

http://www.lkcr.cz/seznam-lekaru-426.html (it's in Czech)

I need to go through every possible combination of "Okres"(region) and "Obor"(specialization)

I tried rvest, but it does not seem to find that there is any dropdown list, html_form returns list of length 0.

therefore, as I am still a newbie in R, how can I "ask" the webpage to show me new combination of pages?

thank you

JH

标签: r rvest
1条回答
姐就是有狂的资本
2楼-- · 2019-06-13 04:44

I'd use the following:

library(rvest)
library(dplyr)
library(tidyr)

pg <- read_html("http://www.lkcr.cz/seznam-lekaru-426.html")

obor <- html_nodes(pg, "select[name='filterObor'] > option")
obor_df <- data_frame(
  value=xml_attr(obor, "value"),
  option=xml_text(obor)
)

glimpse(obor_df)
## Observations: 115
## Variables: 2
## $ value  <chr> "", "16", "107", "17", "1", "19", "20", "21", "22", "29...
## $ option <chr> "", "alergologie a klinická imunologie", "algeziologie"...
okres <- html_nodes(pg, "select[name='filterOkresId'] > option")
okres_df <- data_frame(
  value=xml_attr(okres, "value"),
  option=xml_text(okres)
)

glimpse(okres_df)
## Observations: 78
## Variables: 2
## $ value  <chr> "", "3201", "3202", "3701", "3702", "3703", "3801", "37...
## $ option <chr> "", "Benešov", "Beroun", "Blansko", "Brno-město", "Brno...

in case field order ever changes (plus it's good to get familiar with targeting nodes with CSS selectors and XPath selectors).

You still need to iterate over each pair (you can do that with nested purrr::map calls; I personally prbly wldn't use expand.grid or tidyr::complete for this).

BUT…

You're going to have issues submitting the form with rvest since the site uses javacript to do some data processing before submitting.

You should use Chrome and open up Developer Tools to see what actually gets submitted field-wise and prbly switch to using httr::POST. If you have trouble with that, you should open up a new question on SO.

查看更多
登录 后发表回答