Cleaning HTML with Nokogiri (instead of Tidy)

The tidy gem is no longer maintained and has multiple memory leak issues.

Some people suggested using Nokogiri.

I'm currently cleaning the HTML using:

Nokogiri::HTML::DocumentFragment.parse(html).to_html

I've got two issues though:

Nokogiri removes the DOCTYPE
Is there an easy way to force the cleaned HTML to have a html and body tag?

标签： ruby nokogiri tidy

2条回答

时光不老，我们不散

2楼-- · 2019-04-24 09:30

If you are processing a full document, you want:

Nokogiri::HTML(html).to_html

That will force html and body tags, and introduce or preserve the DOCTYPE:

puts Nokogiri::HTML('<p>Hi!</p>').to_html
#=> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
#=>  "http://www.w3.org/TR/REC-html40/loose.dtd">
#=> <html><body><p>Hi!</p></body></html>

puts Nokogiri::HTML('<!DOCTYPE html><p>Hi!</p>').to_html
#=> <!DOCTYPE html>
#=> <html><body><p>Hi!</p></body></html>

Note that the output is not guaranteed to be syntactically valid. For example, if I provide a broken document that lies and claims that it is HTML4.01 strict, Nokogiri will output a document with that DOCTYPE but without the required <head><title>...</title></head> section:

dtd = '<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">'
puts Nokogiri::HTML("#{dtd}<p>Hi!</p>").to_html
#=> <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
#=>  "http://www.w3.org/TR/html4/strict.dtd">
#=> <html><body><p>Hi!</p></body></html>

0人赞添加讨论(0) 举报

对你真心纯属浪费

3楼-- · 2019-04-24 09:30

The Tidy gem might not be supported, but the underlying tidy app is maintained, and that is what you really need. It's flexible and has quite a list of options.

You can pass HTML to it in many different ways, and define its configuration in a .tidyrc file or pass them on the command-line. You could use Ruby's %x{} to pass it a file or use IO.popen, or IO.pipe to treat it as a pipe.

0人赞添加讨论(0) 举报

Cleaning HTML with Nokogiri (instead of Tidy)

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间