R: posting search forms and scraping results

2019-04-16 16:34发布

问题:

I'm a starter in web scraping and I'm not yet familiarized with the nomenclature for the problems I'm trying to solve. Nevertheless, I've searched exhaustively for this specific problem and was unsuccessful in finding a solution. If it is already somewhere else, I apologize in advance and thank your suggestions.

Getting to it. I'm trying to build a script with R that will:
1. Search for specific keywords in a newspaper website;
2. Give me the headlines, dates and contents for the number of results/pages that I desire.

I already know how to post the form for the search and scrape the results from the first page, but I've had no success so far in getting the content from the next pages. To be honest, I don't even know where to start from (I've read stuff about RCurl and so on, but it still haven't made much sense to me).

Below, it follows a partial sample of the code I've written so far (scraping only the headlines of the first page to keep it simple).

curl <- getCurlHandle()
curlSetOpt(cookiefile='cookies.txt', curl=curl, followlocation = TRUE)
options(RCurlOptions = list(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl")))

search=getForm("http://www.washingtonpost.com/newssearch/search.html", 
               .params=list(st="Dilma Rousseff"), 
               .opts=curlOptions(followLocation = TRUE), 
               curl=curl)
results=htmlParse(search)
results=xmlRoot(results)
results=getNodeSet(results,"//div[@class='pb-feed-headline']/h3")
results=unlist(lapply(results, xmlValue))

I understand that I could perform the search directly on the website and then inspect the URL for references regarding the page numbers or the number of the news article displayed in each page and, then, use a loop to scrape each different page.

But please bear in mind that after I learn how to go from page 1 to page 2, 3, and so on, I will try to develop my script to perform more searches with different keywords in different websites, all at the same time, so the solution in the previous paragraph doesn't seem the best to me so far.

If you have any other solution to suggest me, I will gladly embrace it. I hope I've managed to state my issue clearly so I can get a share of your ideas and maybe help others that are facing similar issues. I thank you all in advance.

Best regards

回答1:

First, I'd recommend you use httr instead of RCurl - for most problems it's much easier to use.

r <- GET("http://www.washingtonpost.com/newssearch/search.html", 
  query = list(
    st = "Dilma Rousseff"
  )
)
stop_for_status(r)
content(r)

Second, if you look at url in your browse, you'll notice that clicking the page number, modifies the startat query parameter:

r <- GET("http://www.washingtonpost.com/newssearch/search.html", 
  query = list(
    st = "Dilma Rousseff",
    startat = 10
  )
)

Third, you might want to try out my experiment rvest package. It makes it easier to extract information from a web page:

# devtools::install_github("hadley/rvest")
library(rvest)

page <- html(r)
links <- page[sel(".pb-feed-headline a")]
links["href"]
html_text(links)

I highly recommend reading the selectorgadget tutorial and using that to figure out what css selectors you need.



标签: r rcurl