Scraping html table with images using XML R packag

2020-06-28 01:48发布

问题:

I want to scrape html tables using the XML package of R, in a similar way to discussed on this thread:

Scraping html tables into R data frames using the XML package

The main difference with the data I want to extract, is that I also want text relating to an image in the html table. For example the table at http://www.theplantlist.org/tpl/record/kew-422570 contains a column for "Confidence" with an image showing one to three stars. If I use:

readHTMLTable("http://www.theplantlist.org/tpl/record/kew-422570")

then the output column for "Confidence" is blank apart from the header. Is there any way to get some form of data in this column, for example the HTML code linking to the appropriate image?

Any suggestions of how to go about this would be much appreciated!

回答1:

I was able to find the Xpath query to the image name using SelectorGadeget

library(XML)
library(RCurl)
d = htmlParse(getURL("http://www.theplantlist.org/tpl/record/kew-422570"))
path = '//*[contains(concat( " ", @class, " " ), concat( " ", "synonyms", " " ))]//img'

xpathSApply(d, path, xmlAttrs)["src",]

[1] "/img/H.png" "/img/L.png" "/img/H.png" "/img/H.png" "/img/H.png"
[6] "/img/H.png" "/img/H.png"


回答2:

Here's an rvest solution with an even simpler CSS selector:

library(rvest)

pg <- html("http://www.theplantlist.org/tpl/record/kew-422570")
pg %>% html_nodes("td > img") %>% html_attr("src")

## [1] "/img/H.png" "/img/L.png" "/img/H.png" "/img/H.png" "/img/H.png"
## [6] "/img/H.png" "/img/H.png"


回答3:

You could also use the elFun argument to extract that attribute following section 5.2.2.1 in the XML book (I had to add ... to avoid an unused argument error)

getCL <- function(node, ...){
if(xmlName(node) == "td" && !is.null(node[["img"]]))
    xmlGetAttr(node[["img"]], "alt")
  else
    xmlValue(node)
}

url <- "http://www.theplantlist.org/tpl/record/kew-422570"
readHTMLTable(url, which=1, elFun = getCL)

                                                Name  Status Confi­-dence level Source
1                                Elymus arenarius L. Synonym                 H   WCSP
2 Elymus arenarius subsp. geniculatus (Curtis) Husn. Synonym                 L    TRO
3                Elymus geniculatus Curtis [Invalid] Synonym                 H   WCSP
4              Frumentum arenarium (L.) E.H.L.Krause Synonym                 H   WCSP
5                       Hordeum arenarium (L.) Asch. Synonym                 H   WCSP
6                            Hordeum villosum Moench Synonym                 H   WCSP
7                    Triticum arenarium (L.) F.Herm. Synonym                 H   WCSP