I noticed something strange using Nokogiri recently. All of the HTML I had been parsing had been given start and end <html>
and <body>
tags.
<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n<html><body>\n
How can I prevent Nokogiri from doing this?
I.E., when I do:
doc = Nokogiri::HTML("<div>some content</div>")
doc.to_s
or:
doc.to_html
I get the original:
<html blah><body>div>some content</div></body></html>
The
to_s
method on aNokogiri::HTML::Document
outputs a valid HTML page, complete with its required elements. This is not necessarily what was passed in to the parser.If you want to output less than a complete document, you use methods such as
inner_html
,inner_text
, etc., on a node.Edit: if you are not expecting to parse a complete, well-formed XML document as input, then theTinMan's answer is best.
The problem occurs because you're using the wrong method in Nokogiri to parse your content.
Rather than using
HTML
which results in a complete document, useHTML.fragment
, which tells Nokogiri you only want the fragment parsed: