I am using the rvest
package to scrape information from the page http://www.radiolab.org/series/podcasts. After scraping the first page, I want to follow the "Next" link at the bottom, scrape that second page, move onto the third page, etc.
The following line gives an error:
html_session("http://www.radiolab.org/series/podcasts") %>% follow_link("Next")
## Navigating to
##
## ./2/
## Error in parseURI(u) : cannot parse URI
##
## ./2/
Inspecting the HTML shows there is some extra cruft around the "./2/" that rvest
apparently doesn't like:
html("http://www.radiolab.org/series/podcasts") %>% html_node(".pagefooter-next a")
## <a href=" ./2/ ">Next</a>
.Last.value %>% html_attrs()
## href
## "\n \n ./2/ "
Question 1:
How can I get rvest::follow_link
to treat this link correctly like my browser does? (I could manually grab the "Next" link and clean it up with regex, but prefer to take advantage of the automation provided with rvest
.)
At the end of the follow_link
code, it calls jump_to
. So I tried the following:
html_session("http://www.radiolab.org/series/podcasts") %>% jump_to("./2/")
## <session> http://www.radiolab.org/series/2/
## Status: 404
## Type: text/html; charset=utf-8
## Size: 10744
## Warning message:
## In request_GET(x, url, ...) : client error: (404) Not Found
Digging into the code, it looks like the issue is with XML::getRelativeURL
, which uses dirname
to strip off the last part of the original path ("/podcasts"):
XML::getRelativeURL("./2/", "http://www.radiolab.org/series/podcasts/")
## [1] "http://www.radiolab.org/series/./2"
XML::getRelativeURL("../3/", "http://www.radiolab.org/series/podcasts/2/")
## [1] "http://www.radiolab.org/series/3"
Question 2:
How can I get rvest::jump_to
and XML::getRelativeURL
to correctly handle relative paths?