I have a string vector which contains html tags e.g
abc<-""welcome <span class=\"r\"><a href=\"abc\">abc</a></span> Have fun!""
I want to remove these tags and get follwing vector
e.g
abc<-"welcome Have fun"
I have a string vector which contains html tags e.g
abc<-""welcome <span class=\"r\"><a href=\"abc\">abc</a></span> Have fun!""
I want to remove these tags and get follwing vector
e.g
abc<-"welcome Have fun"
You can convert your piece of HTML to an XML document with
htmlParse
orhtmlTreeParse
. You can then convert it to text, i.e., strip all the tags, withxmlValue
.If you also want to remove the contents of the links, you can use
xmlDOMApply
to transform the XML tree.Try
what this says is 'substitute every instance of < followed by anything that isnt a > up to a > with nothing"
You cant just do
gsub("<.*>","",abc)
because regexps are greedy, and the .* would match up to the last > in your text (and you'd lose the 'abc' in your example).This solution might fail if you've got > in your tags - but is
<foo class=">" >
legal? Doubtless someone will come up with another answer that involves parsing the HTML with a heavyweight XML package.