Find cell in html table containing a specific icon

2019-05-12 16:30发布

I am looking for code that can inform me in which cell of an html table a particular icon resides. Here is what I am working with:

u <- "http://www.transfermarkt.nl/lionel-messi/leistungsdaten/spieler/28003/saison/2014/plus/1"
doc <- rvest::html(u)
tab <- rvest::html_table(doc, fill = TRUE)[[6]]

The column "Pos." designates the player's position in the field. Some of these have an additional icon. I can see the presence of these icons on the page as follows:

rvest::html_nodes(doc, ".kapitaenicon-table")

but this doesn't tell me WHERE they are. I would like my code to return that the icon occurs in rows 2, 10, 11, 27 of the "Pos. column" in the table. How can I do that?

标签: r rvest
1条回答
祖国的老花朵
2楼-- · 2019-05-12 17:01

A little bit more rvest and XPath magic can get you the indices:

library(rvest)
library(magrittr)
library(XML)

pg <- html("http://www.transfermarkt.nl/lionel-messi/leistungsdaten/spieler/28003/saison/2014/plus/1")

pg %>% 
  html_nodes("table") %>% 
  extract2(6) %>% 
  html_nodes("tbody > tr") %>% 
  sapply(function(x) {
    length(xpathSApply(x, "./td[8]/span[@class='kapitaenicon-table icons_sprite']")) == 1
  }) %>% which

## [1]  2 10 11 27

That gets the 6th table, extracts the trs then looks through them for an 8th td with the proper span/class in it. If the XPath search fails it returns an empty list, so you can use the length to determine which rows have the td with the icon in them and which do not.

This:

pg %>% 
  html_nodes(xpath="//table[6]/tbody/tr/td[8]") %>% 
  xmlSApply(xpathApply, "boolean(./span[@class='kapitaenicon-table icons_sprite'])") %>% 
  which

also works and it a bit tighter (and faster). It uses the XPath boolean operation to test for existence. This is handier if you have no other operations to perform on the node(s).

This is an xml2 version, though I have to believe there has to be a better way to do this in xml2:

library(xml2)
library(magrittr)

pg2 <- read_html("http://www.transfermarkt.nl/lionel-messi/leistungsdaten/spieler/28003/saison/2014/plus/1")
pg2 %>% 
  xml_find_all("//table[6]/tbody/tr/td[8]") %>% 
  as_list %>% 
  sapply(function(x) {
    inherits(try(xml_find_one(x, "./span"), silent=TRUE), "xml_node")
  }) %>% which

UPDATE

For version 0.1.0.9000 of xml2 I had to do the following:

pg2 %>% xml_find_all("//table") %>% 
  as_list %>% 
  extract2(6) %>% 
  xml_find_all("./tbody/tr/td[8]") %>% 
  as_list %>% 
  sapply(function(x) {
    inherits(try(xml_find_one(x, "./span"), silent=TRUE), "xml_node")
  }) %>% which

That should not be the case and I've filed a bug report.

Session info -------------------------------------------------------------------------
 setting  value                       
 version  R version 3.2.0 (2015-04-16)
 system   x86_64, darwin13.4.0        
 ui       RStudio (0.99.441)          
 language (EN)                        
 collate  en_US.UTF-8                 
 tz       America/New_York            

Packages -----------------------------------------------------------------------------
 package    * version date       source        
 curl       * 0.5     2015-02-01 CRAN (R 3.2.0)
 devtools   * 1.7.0   2015-01-17 CRAN (R 3.2.0)
 magrittr     1.5     2014-11-22 CRAN (R 3.2.0)
 Rcpp       * 0.11.5  2015-03-06 CRAN (R 3.2.0)
 rstudioapi * 0.3.1   2015-04-07 CRAN (R 3.2.0)
 xml2         0.1.0   2015-04-20 CRAN (R 3.2.0)
查看更多
登录 后发表回答