There are lots of examples of how to strip HTML tags from a document using Ruby, Hpricot and Nokogiri have inner_text methods that remove all HTML for you easily and quickly.
What I am trying to do is the opposite, remove all the text from an HTML document, leaving just the tags and their attributes.
I considered looping through the document setting inner_html to nil but then really you'd have to do this in reverse as the first element (root) has an inner_html of the entire rest of the document, so ideally I'd have to start at the inner most element and set inner_html to nil whilst moving up through the ancestors.
Does anyone know a neat little trick for doing this efficiently? I was thinking perhaps regex's might do it but probably not as efficiently as an HTML tokenizer/parser might.
To grab everything not in a tag, you can use nokogiri like this:
Of course, that will grab stuff like the contents of
<script>
or<style>
tags, so you could also remove blacklisted tags:You could also whitelist if you preferred, but that's probably going to be more time-intensive:
You could also just build a huge XPath expression and do one search. I honestly don't know which way is faster, or if there is even an appreciable difference.
You can scan the string to create an array of "tokens", and then only select those that are html tags:
==Edit==
Or even better, just scan for html tags ;)
I just came up with this, but @andre-r's solution is soo much better!
This works too: