removing data with tags from a vector

2020-03-31 04:44发布

I have a string vector which contains html tags e.g

  abc<-""welcome <span class=\"r\"><a href=\"abc\">abc</a></span> Have fun!""

I want to remove these tags and get follwing vector

e.g

       abc<-"welcome Have fun"

标签: r
2条回答
祖国的老花朵
2楼-- · 2020-03-31 05:26

You can convert your piece of HTML to an XML document with htmlParse or htmlTreeParse. You can then convert it to text, i.e., strip all the tags, with xmlValue.

abc <- "welcome <span class=\"r\"><a href=\"abc\">abc</a></span> Have fun!"
library(XML)
#doc <- htmlParse(abc, asText=TRUE)
doc <- htmlTreeParse(abc, asText=TRUE)
xmlValue( xmlRoot(doc) )

If you also want to remove the contents of the links, you can use xmlDOMApply to transform the XML tree.

f <- function(x) if(xmlName(x) == "span") xmlTextNode(" ") else x
d <- xmlDOMApply( xmlRoot(doc), f )
xmlValue(d)
查看更多
戒情不戒烟
3楼-- · 2020-03-31 05:27

Try

> gsub("(<[^>]*>)","",abc)

what this says is 'substitute every instance of < followed by anything that isnt a > up to a > with nothing"

You cant just do gsub("<.*>","",abc) because regexps are greedy, and the .* would match up to the last > in your text (and you'd lose the 'abc' in your example).

This solution might fail if you've got > in your tags - but is <foo class=">" > legal? Doubtless someone will come up with another answer that involves parsing the HTML with a heavyweight XML package.

查看更多
登录 后发表回答