Rvest: getting node text and not its childen's

2019-07-14 03:46发布

问题:

The method html_text() (from R Package rvest) concatenates the text of the node and all its children. I would like to extract only the father's text.

Forthe following example, html_text() gives HELLO GOODBYE.

I want to get just GOODBYE. How can I get it?

<div class="joke">
  <div class="div_inside">
    <div class="title_inside">
      <a class="link" href="sompage.htm">HELLO</a>
    </div>
  </div>
  GOODBYE
</div>

回答1:

Try to grab the main div tag with class "joke" without picking up its children, using xpath:

library(rvest)

read_html('your_html_script') %>%
    html_nodes(xpath = '//div[@class="joke"]/node()[not(self::div)]') %>% 
    html_text()

Thanks!