I use XML package to get the links from this url.
# Parse HTML URL
v1WebParse <- htmlParse(v1URL)
# Read links and and get the quotes of the companies from the href
t1Links <- data.frame(xpathSApply(v1WebParse, '//a', xmlGetAttr, 'href'))
While this method is very efficient, I've used rvest
and seems faster at parsing a web than XML
. I tried html_nodes
and html_attrs
but I can't get it to work.
Despite my comment, here's how you can do it with rvest
. Note that we need to read in the page with htmlParse
first since the site has the content-type set to text/plain
for that file and that tosses rvest
into a tizzy.
library(rvest)
library(XML)
pg <- htmlParse("http://www.bvl.com.pe/includes/empresas_todas.dat")
pg %>% html_nodes("a") %>% html_attr("href")
## [1] "/inf_corporativa71050_JAIME1CP1A.html" "/inf_corporativa10400_INTEGRC1.html"
## [3] "/inf_corporativa66100_ACESEGC1.html" "/inf_corporativa71300_ADCOMEC1.html"
## ...
## [273] "/inf_corporativa64801_VOLCAAC1.html" "/inf_corporativa58501_YURABC11.html"
## [275] "/inf_corporativa98959_ZNC.html"
That further illustrates rvest
's XML
package underpinnings.
UPDATE
rvest::read_html()
can handle this directly now:
pg <- read_html("http://www.bvl.com.pe/includes/empresas_todas.dat")
I know you're looking for an rvest
answer, but here's another way using the XML
package that might be more efficient than what you're doing.
Have you seen the getLinks()
function in example(htmlParse)
? I use this modified version from the examples to get href
links. It's a handler function so we can collect the values as they are read, saving on memory and increasing efficiency.
links <- function(URL)
{
getLinks <- function() {
links <- character()
list(a = function(node, ...) {
links <<- c(links, xmlGetAttr(node, "href"))
node
},
links = function() links)
}
h1 <- getLinks()
htmlTreeParse(URL, handlers = h1)
h1$links()
}
links("http://www.bvl.com.pe/includes/empresas_todas.dat")
# [1] "/inf_corporativa71050_JAIME1CP1A.html"
# [2] "/inf_corporativa10400_INTEGRC1.html"
# [3] "/inf_corporativa66100_ACESEGC1.html"
# [4] "/inf_corporativa71300_ADCOMEC1.html"
# [5] "/inf_corporativa10250_HABITAC1.html"
# [6] "/inf_corporativa77900_PARAMOC1.html"
# [7] "/inf_corporativa77935_PUCALAC1.html"
# [8] "/inf_corporativa77600_LAREDOC1.html"
# [9] "/inf_corporativa21000_AIBC1.html"
# ...
# ...
# Option 1
library(RCurl)
getHTMLLinks('http://www.bvl.com.pe/includes/empresas_todas.dat')
# Option 2
library(rvest)
library(pipeR) # %>>% will be faster than %>%
html("http://www.bvl.com.pe/includes/empresas_todas.dat")%>>% html_nodes("a") %>>% html_attr("href")
Richard's answer works for HTTP pages but not the HTTPS page I needed (Wikipedia). I substituted RCurl's getURL function as below:
library(RCurl)
links <- function(URL)
{
getLinks <- function() {
links <- character()
list(a = function(node, ...) {
links <<- c(links, xmlGetAttr(node, "href"))
node
},
links = function() links)
}
h1 <- getLinks()
xData <- getURL(URL)
htmlTreeParse(xData, handlers = h1)
h1$links()
}