Read HTML Table Into Data Frame with Hyperlinks in

2019-08-16 17:34发布

I am trying to read an HTML table from a publicly-accessible website into a data frame in R. The final column of the table contains hyperlinks, and I would like to read these hyperlinks into the table rather than the text that is displayed on the webpage. I've reviewed several posts here on StackOverflow and on other sites and have gotten almost there, but I haven't been able to read the hyperlinks themselves.

The table I'm trying to read is here: http://mis.ercot.com/misapp/GetReports.do?reportTypeId=12300&reportTitle=LMPs%20by%20Resource%20Nodes,%20Load%20Zones%20and%20Trading%20Hubs&showHTMLView=&mimicKey.

The final column contains hyperlinks that point to the actual data in *.ZIP file format for download. I've managed to read the table into R as text, but I can't figure out how to resolve the hyperlinks in the final column.

Here's what I have so far:

library(XML)
webURL <- 'http://mis.ercot.com/misapp/GetReports.do?reportTypeId=12300&reportTitle=LMPs%20by%20Resource%20Nodes,%20Load%20Zones%20and%20Trading%20Hubs&showHTMLView=&mimicKey'
page <- htmlParse( webURL )
tableNodes <- getNodeSet( sitePage, "//table" )
myTable <- readHTMLTable( tableNodes[[3]] )

However, this contains the text in the final column, not the hyperlink. How do I replace the word "zip" in the final column of this table in R with the values for the corresponding hyperlink in each row?

2条回答
干净又极端
2楼-- · 2019-08-16 17:58

This code will let you target either the XML files or the CSV files and you get the filename as well as the URL so you can then iterate over the URLs and filenames and save them with names you'll recognize later on.

library(rvest)
library(dplyr)

pg <- read_html("http://mis.ercot.com/misapp/GetReports.do?reportTypeId=12300&reportTitle=LMPs%20by%20Resource%20Nodes,%20Load%20Zones%20and%20Trading%20Hubs&showHTMLView=&mimicKey")

csv_fils <- html_nodes(pg, xpath=".//td[contains(@class, 'labelOptional_ind') and contains(., 'csv')]/..")

data_frame(
  fil_name = html_nodes(csv_fils, "td.labelOptional_ind") %>% html_text(),
  url = html_nodes(csv_fils, xpath=".//td[4]/div/a") %>% html_attr("href")
) -> csv_df

glimpse(csv_df)
## Observations: 1,560
## Variables: 2
## $ fil_name <chr> "cdr.00012300.0000000000000000.20170729.094015151.LMPSROSNODENP6788_20170729_094011_csv.zip", "cdr...
## $ url      <chr> "/misdownload/servlets/mirDownload?mimic_duns=&doclookupId=572923018", "/misdownload/servlets/mirD...

xml_fils <- html_nodes(pg, xpath=".//td[contains(@class, 'labelOptional_ind') and contains(., 'xml')]/..")

data_frame(
  fil_name = html_nodes(xml_fils, "td.labelOptional_ind") %>% html_text(),
  url = html_nodes(xml_fils, xpath=".//td[4]/div/a") %>% html_attr("href")
) -> xml_df

glimpse(xml_df)
## Observations: 1,560
## Variables: 2
## $ fil_name <chr> "cdr.00012300.0000000000000000.20170729.094015016.LMPSROSNODENP6788_20170729_094011_xml.zip", "cdr...
## $ url      <chr> "/misdownload/servlets/mirDownload?mimic_duns=&doclookupId=572923015", "/misdownload/servlets/mirD...
查看更多
我命由我不由天
3楼-- · 2019-08-16 18:14

I find using the rvest package easier than XML.

Here is a solution to obtain a list of the links:

webURL <- 'http://mis.ercot.com/misapp/GetReports.do?reportTypeId=12300&reportTitle=LMPs%20by%20Resource%20Nodes,%20Load%20Zones%20and%20Trading%20Hubs&showHTMLView=&mimicKey'

library(rvest)

page<-read_html(webURL)
links<-page %>% html_nodes("a") %>% html_attr("href")
查看更多
登录 后发表回答