I have some HTML pages where the contents to be extracted are marked with HTML comments like below.
<html>
.....
<!-- begin content -->
<div>some text</div>
<div><p>Some more elements</p></div>
<!-- end content -->
...
</html>
I am using Nokogiri and trying to extract the HTML between the <!-- begin content -->
and <!-- end content -->
comments.
I want to extract the full elements between these two HTML comments:
<div>some text</div>
<div><p>Some more elements</p></div>
I can get the text-only version using this characters callback:
class TextExtractor < Nokogiri::XML::SAX::Document
def initialize
@interesting = false
@text = ""
@html = ""
end
def comment(string)
case string.strip # strip leading and trailing whitespaces
when /^begin content/ # match starting comment
@interesting = true
when /^end content/
@interesting = false # match closing comment
end
def characters(string)
@text << string if @interesting
end
end
I get the text-only version with @text
but I need the full HTML stored in @html
.