I am scraping blog text using RVest and am struggling to figure out a simple way to exclude specific nodes. The following pulls the text:
AllandSundry_test <- read_html
("http://www.sundrymourning.com/2017/03/03/lets-go-back-to-commenting-on-the-weather/")
testpost <- AllandSundry_test %>%
html_node("#contentmiddle") %>%
html_text() %>%
as.character()
I want to exclude the two nodes with ID's "contenttitle" and "commentblock". Below, I try excluding just the comments using the tag "commentblock".
testpost <- AllandSundry_test %>%
html_node("#contentmiddle") %>%
html_node(":not(#commentblock)")
html_text() %>%
as.character()
When I run this, the result is simply the date -- all the rest of the text is gone. Any suggestions?
I have spent a lot of time searching for an answer, but I am new to R (and html), so I appreciate your patience if this is something obvious.
It certainly looks like GGamba solved it for you- however, in my machine, I had to remove the > after
#contentmiddle
. Therefore, this section was instead:Best of luck! Jesse
You were almost there. You should use
html_nodes
instead ofhtml_node
.html_node
retrieves the first element it encounter, whilehtml_nodes
returns each matching element in the page as a list.The
toString()
function collapse the list of strings into one.You still need to clean up the string a bit.