Read HTML Table Into Data Frame with Hyperlinks in

2019-08-16 17:28发布

问题:

I am trying to read an HTML table from a publicly-accessible website into a data frame in R. The final column of the table contains hyperlinks, and I would like to read these hyperlinks into the table rather than the text that is displayed on the webpage. I've reviewed several posts here on StackOverflow and on other sites and have gotten almost there, but I haven't been able to read the hyperlinks themselves.

The table I'm trying to read is here: http://mis.ercot.com/misapp/GetReports.do?reportTypeId=12300&reportTitle=LMPs%20by%20Resource%20Nodes,%20Load%20Zones%20and%20Trading%20Hubs&showHTMLView=&mimicKey.

The final column contains hyperlinks that point to the actual data in *.ZIP file format for download. I've managed to read the table into R as text, but I can't figure out how to resolve the hyperlinks in the final column.

Here's what I have so far:

library(XML)
webURL <- 'http://mis.ercot.com/misapp/GetReports.do?reportTypeId=12300&reportTitle=LMPs%20by%20Resource%20Nodes,%20Load%20Zones%20and%20Trading%20Hubs&showHTMLView=&mimicKey'
page <- htmlParse( webURL )
tableNodes <- getNodeSet( sitePage, "//table" )
myTable <- readHTMLTable( tableNodes[[3]] )

However, this contains the text in the final column, not the hyperlink. How do I replace the word "zip" in the final column of this table in R with the values for the corresponding hyperlink in each row?

回答1:

This code will let you target either the XML files or the CSV files and you get the filename as well as the URL so you can then iterate over the URLs and filenames and save them with names you'll recognize later on.

library(rvest)
library(dplyr)

pg <- read_html("http://mis.ercot.com/misapp/GetReports.do?reportTypeId=12300&reportTitle=LMPs%20by%20Resource%20Nodes,%20Load%20Zones%20and%20Trading%20Hubs&showHTMLView=&mimicKey")

csv_fils <- html_nodes(pg, xpath=".//td[contains(@class, 'labelOptional_ind') and contains(., 'csv')]/..")

data_frame(
  fil_name = html_nodes(csv_fils, "td.labelOptional_ind") %>% html_text(),
  url = html_nodes(csv_fils, xpath=".//td[4]/div/a") %>% html_attr("href")
) -> csv_df

glimpse(csv_df)
## Observations: 1,560
## Variables: 2
## $ fil_name <chr> "cdr.00012300.0000000000000000.20170729.094015151.LMPSROSNODENP6788_20170729_094011_csv.zip", "cdr...
## $ url      <chr> "/misdownload/servlets/mirDownload?mimic_duns=&doclookupId=572923018", "/misdownload/servlets/mirD...

xml_fils <- html_nodes(pg, xpath=".//td[contains(@class, 'labelOptional_ind') and contains(., 'xml')]/..")

data_frame(
  fil_name = html_nodes(xml_fils, "td.labelOptional_ind") %>% html_text(),
  url = html_nodes(xml_fils, xpath=".//td[4]/div/a") %>% html_attr("href")
) -> xml_df

glimpse(xml_df)
## Observations: 1,560
## Variables: 2
## $ fil_name <chr> "cdr.00012300.0000000000000000.20170729.094015016.LMPSROSNODENP6788_20170729_094011_xml.zip", "cdr...
## $ url      <chr> "/misdownload/servlets/mirDownload?mimic_duns=&doclookupId=572923015", "/misdownload/servlets/mirD...


回答2:

I find using the rvest package easier than XML.

Here is a solution to obtain a list of the links:

webURL <- 'http://mis.ercot.com/misapp/GetReports.do?reportTypeId=12300&reportTitle=LMPs%20by%20Resource%20Nodes,%20Load%20Zones%20and%20Trading%20Hubs&showHTMLView=&mimicKey'

library(rvest)

page<-read_html(webURL)
links<-page %>% html_nodes("a") %>% html_attr("href")