R数据刮/动态/多个网址爬行(R data scraping / crawling with dyn

2019-10-28 08:36发布

我试图让瑞士获得联邦最高法院的所有法令在: https://www.bger.ch/ext/eurospider/live/de/php/aza/http/index.php?lang=de&type=simple_query&query_words= &LANG = DE&top_subcollection_aza =所有&FROM_DATE =&TO_DATE =&X = 12&y = 12不幸的是,没有设置API。 我要检索的数据的CSS选择器是.para

我知道http://relevancy.bger.ch/robots.txt 。

User-agent: *
Disallow: /javascript
Disallow: /css
Disallow: /hashtables
Disallow: /stylesheets
Disallow: /img
Disallow: /php/jurivoc
Disallow: /php/taf
Disallow: /php/azabvger
Sitemap: http://relevancy.bger.ch/sitemaps/sitemapindex.xml
Crawl-delay: 2

对我来说,它看起来像我在看被允许抓取网址,是正确的? 无论如何,联邦CORT解释说,这些规则是针对大型搜索引擎和个人爬行是可以容忍的。

我可以检索单个法令中的数据(使用https://www.analyticsvidhya.com/blog/2017/03/beginners-guide-on-web-scraping-in-r-using-rvest-with-hands-on -知识/ )

url <- 'https://www.bger.ch/ext/eurospider/live/de/php/aza/http/index.php?lang=de&type=highlight_simple_query&page=1&from_date=&to_date=&sort=relevance&insertion_date=&top_subcollection_aza=all&query_words=&rank=1&azaclir=aza&highlight_docid=aza%3A%2F%2F18-12-2017-6B_790-2017&number_of_ranks=113971'

webpage <- read_html(url)

decree_html <- html_nodes(webpage,'.para')

rank_data <- html_text(decree_html)

decree1_data <- html_text(decree_html)

然而,由于rvest只从一个特定页面中提取数据,我的数据是在多个页面上,我试图Rcrawler这样做( https://github.com/salimk/Rcrawler ),但我不知道如何抓取特定网站在structur www.bger.ch获得与法令的所有URL。




Answer 1:

I don't do error handling below since that's beyond the scope of this question.

Let's start with the usual suspects:


We'll define a function that will get us a page of search results by page number. I've hard-coded the search parameters since you provided the URL.

In this function, we:

  • get the page HTML
  • get the links to the documents we want to scrape
  • get document metdata
  • make a data frame
  • add attributes to the data frame for page number grabbed and whether there are more pages to grab

It's pretty straightforward:

get_page <- function(page_num=1) {

    url = "https://www.bger.ch/ext/eurospider/live/de/php/aza/http/index.php",
    query = list(
  ) -> res

  warn_for_status(res) # shld be "stop" and you should do error handling

  pg <- content(res)

  links <- html_nodes(pg, "div.ranklist_content ol li")

    link = html_attr(html_nodes(links, "a"), "href"),
    title = html_text(html_nodes(links, "a"), trim=TRUE),
    court = html_text(html_nodes(links, xpath=".//a/../../div/div[contains(@class, 'court')]"), trim=TRUE), # these are "dangerous" if they aren't there but you can wrap error handling around this
    subject = html_text(html_nodes(links, xpath=".//a/../../div/div[contains(@class, 'subject')]"), trim=TRUE),
    object = html_text(html_nodes(links, xpath=".//a/../../div/div[contains(@class, 'object')]"), trim=TRUE)
  ) -> xdf

  # this looks for the text at the bottom paginator. if there's no link then we're done

  attr(xdf, "page") <- page_num
  attr(xdf, "has_next") <- html_node(pg, xpath="boolean(.//a[contains(., 'Vorwärts')])")



Make a helper function since I can't stand typing attr(...) and it reads better in use:

has_next <- function(x) { attr(x, "has_next") } 

Now, make a scraping loop. I stop at 6 just b/c. You should remove that logic for scraping everything. Consider doing this in batches since internet connections are unstable things:

pg_num <- 0
all_links <- list()

repeat {
  cat(".") # poor dude's progress ber
  pg_num <- pg_num + 1
  pg_df <- get_page(pg_num)
  if (!has_next(pg_df)) break
  all_links <- append(all_links, list(pg_df))
  if (pg_num == 6) break # this is here for me since I don't need ~11,000 documents
  Sys.sleep(2) # robots.txt crawl delay

Turn the list of data frames into one big one. NOTE: You should do validity tests before this since web scraping is fraught with peril. You should also save off this data frame to an RDS file so you don't have to do it again.

lots_of_links <- bind_rows(all_links)

## Observations: 60
## Variables: 5
## $ link    <chr> "https://www.bger.ch/ext/eurospider/live/de/php/aza/http/index.php?lang=de&type=highlight_simple_query&...
## $ title   <chr> "18.12.2017 6B 790/2017", "14.12.2017 6G 2/2017", "13.12.2017 5A 975/2017", "13.12.2017 5D 257/2017", "...
## $ court   <chr> "Strafrechtliche Abteilung", "Cour de droit pénal", "II. zivilrechtliche Abteilung", "II. zivilrechtlic...
## $ subject <chr> "Straf- und Massnahmenvollzug", "Procédure pénale", "Familienrecht", "Schuldbetreibungs- und Konkursrec...
## $ object  <chr> "Bedingte Entlassung aus der Verwahrung, Beschleunigungsgebot", "Demande d'interprétation et de rectifi...

With all the link in hand, we'll get the documents.

Define a helper function. NOTE we aren't parsing here. Do that separately. We'll store the inner content <div> HTML text so you can parse it later.

get_documents <- function(urls) {
  map_chr(urls, ~{
    cat(".") # poor dude's progress ber
    Sys.sleep(2) # robots.txt crawl delay 
    read_html(.x) %>% 
      xml_node("div.content") %>% 
      as.character() # we do this b/c we aren't parsing it yet but xml2 objects don't serialize at all

Here's how to use it. Again, remove head() but also consider doing it in batches.

head(lots_of_links) %>% # I'm not waiting for 60 documents
  mutate(content = get_documents(link)) -> links_and_docs

## Observations: 6
## Variables: 6
## $ link    <chr> "https://www.bger.ch/ext/eurospider/live/de/php/aza/http/index.php?lang=de&type=highlight_simple_query&...
## $ title   <chr> "18.12.2017 6B 790/2017", "14.12.2017 6G 2/2017", "13.12.2017 5A 975/2017", "13.12.2017 5D 257/2017", "...
## $ court   <chr> "Strafrechtliche Abteilung", "Cour de droit pénal", "II. zivilrechtliche Abteilung", "II. zivilrechtlic...
## $ subject <chr> "Straf- und Massnahmenvollzug", "Procédure pénale", "Familienrecht", "Schuldbetreibungs- und Konkursrec...
## $ object  <chr> "Bedingte Entlassung aus der Verwahrung, Beschleunigungsgebot", "Demande d'interprétation et de rectifi...
## $ content <chr> "<div class=\"content\">\n      \n<div class=\"para\"> </div>\n<div class=\"para\">Bundesgericht </div>...

You still need error & validity checking in various places and may need to re-scrape pages if there are server errors or parsing issues. But this is how to build a site-specific crawler of this nature.

文章来源: R data scraping / crawling with dynamic/multiple URLs