Read HTML Table Into Data Frame with Hyperlinks in

I am trying to read an HTML table from a publicly-accessible website into a data frame in R. The final column of the table contains hyperlinks, and I would like to read these hyperlinks into the table rather than the text that is displayed on the webpage. I've reviewed several posts here on StackOverflow and on other sites and have gotten almost there, but I haven't been able to read the hyperlinks themselves.

The table I'm trying to read is here: http://mis.ercot.com/misapp/GetReports.do?reportTypeId=12300&reportTitle=LMPs%20by%20Resource%20Nodes,%20Load%20Zones%20and%20Trading%20Hubs&showHTMLView=&mimicKey.

The final column contains hyperlinks that point to the actual data in *.ZIP file format for download. I've managed to read the table into R as text, but I can't figure out how to resolve the hyperlinks in the final column.

Here's what I have so far:

library(XML)
webURL <- 'http://mis.ercot.com/misapp/GetReports.do?reportTypeId=12300&reportTitle=LMPs%20by%20Resource%20Nodes,%20Load%20Zones%20and%20Trading%20Hubs&showHTMLView=&mimicKey'
page <- htmlParse( webURL )
tableNodes <- getNodeSet( sitePage, "//table" )
myTable <- readHTMLTable( tableNodes[[3]] )

However, this contains the text in the final column, not the hyperlink. How do I replace the word "zip" in the final column of this table in R with the values for the corresponding hyperlink in each row?

标签： html r xml hyperlink rvest

2条回答

干净又极端

2楼-- · 2019-08-16 17:58

This code will let you target either the XML files or the CSV files and you get the filename as well as the URL so you can then iterate over the URLs and filenames and save them with names you'll recognize later on.

library(rvest)
library(dplyr)

pg <- read_html("http://mis.ercot.com/misapp/GetReports.do?reportTypeId=12300&reportTitle=LMPs%20by%20Resource%20Nodes,%20Load%20Zones%20and%20Trading%20Hubs&showHTMLView=&mimicKey")

csv_fils <- html_nodes(pg, xpath=".//td[contains(@class, 'labelOptional_ind') and contains(., 'csv')]/..")

data_frame(
  fil_name = html_nodes(csv_fils, "td.labelOptional_ind") %>% html_text(),
  url = html_nodes(csv_fils, xpath=".//td[4]/div/a") %>% html_attr("href")
) -> csv_df

glimpse(csv_df)
## Observations: 1,560
## Variables: 2
## $ fil_name <chr> "cdr.00012300.0000000000000000.20170729.094015151.LMPSROSNODENP6788_20170729_094011_csv.zip", "cdr...
## $ url      <chr> "/misdownload/servlets/mirDownload?mimic_duns=&doclookupId=572923018", "/misdownload/servlets/mirD...

xml_fils <- html_nodes(pg, xpath=".//td[contains(@class, 'labelOptional_ind') and contains(., 'xml')]/..")

data_frame(
  fil_name = html_nodes(xml_fils, "td.labelOptional_ind") %>% html_text(),
  url = html_nodes(xml_fils, xpath=".//td[4]/div/a") %>% html_attr("href")
) -> xml_df

glimpse(xml_df)
## Observations: 1,560
## Variables: 2
## $ fil_name <chr> "cdr.00012300.0000000000000000.20170729.094015016.LMPSROSNODENP6788_20170729_094011_xml.zip", "cdr...
## $ url      <chr> "/misdownload/servlets/mirDownload?mimic_duns=&doclookupId=572923015", "/misdownload/servlets/mirD...

0人赞添加讨论(0) 举报

我命由我不由天

3楼-- · 2019-08-16 18:14

I find using the rvest package easier than XML.

Here is a solution to obtain a list of the links:

webURL <- 'http://mis.ercot.com/misapp/GetReports.do?reportTypeId=12300&reportTitle=LMPs%20by%20Resource%20Nodes,%20Load%20Zones%20and%20Trading%20Hubs&showHTMLView=&mimicKey'

library(rvest)

page<-read_html(webURL)
links<-page %>% html_nodes("a") %>% html_attr("href")

0人赞添加讨论(0) 举报

Read HTML Table Into Data Frame with Hyperlinks in

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间