The tidy
gem is no longer maintained and has multiple memory leak issues.
Some people suggested using Nokogiri.
I'm currently cleaning the HTML using:
Nokogiri::HTML::DocumentFragment.parse(html).to_html
I've got two issues though:
Nokogiri removes the
DOCTYPE
Is there an easy way to force the cleaned HTML to have a
html
andbody
tag?
If you are processing a full document, you want:
That will force
html
andbody
tags, and introduce or preserve theDOCTYPE
:Note that the output is not guaranteed to be syntactically valid. For example, if I provide a broken document that lies and claims that it is HTML4.01 strict, Nokogiri will output a document with that DOCTYPE but without the required
<head><title>...</title></head>
section:The Tidy gem might not be supported, but the underlying
tidy
app is maintained, and that is what you really need. It's flexible and has quite a list of options.You can pass HTML to it in many different ways, and define its configuration in a
.tidyrc
file or pass them on the command-line. You could use Ruby's%x{}
to pass it a file or useIO.popen
, orIO.pipe
to treat it as a pipe.