While parsing an indented XML, non-significant white space text nodes are created from the white spaces between a closing and an opening tag. For example, from the following XML:
<note>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>
whose string representation is as follows,
"<note>\n <to>Tove</to>\n <from>Jani</from>\n <heading>Reminder</heading>\n <body>Don't forget me this weekend!</body>\n</note>\n"
the following Document
is created:
#(Document:0x3fc07e4540d8 {
name = "document",
children = [
#(Element:0x3fc07ec8629c {
name = "note",
children = [
#(Text "\n "),
#(Element:0x3fc07ec8089c {
name = "to",
children = [ #(Text "Tove")]
}),
#(Text "\n "),
#(Element:0x3fc07e8d8064 {
name = "from",
children = [ #(Text "Jani")]
}),
#(Text "\n "),
#(Element:0x3fc07e8d588c {
name = "heading",
children = [ #(Text "Reminder")]
}),
#(Text "\n "),
#(Element:0x3fc07e8cf590 {
name = "body",
children = [ #(Text "Don't forget me this weekend!")]
}),
#(Text "\n")]
})]
})
Here, there are lots of white space nodes of type Nokogiri::XML::Text
.
I would like to count the children
of each node in a Nokogiri XML Document
, and access the first or last child, excluding non-significant white spaces. I wish not to parse them, or distinguish between those and significant text nodes such as those inside the element <to>
, like "Tove"
. Here is an rspec of what I am looking for:
require 'nokogiri'
require_relative 'spec_helper'
xml_text = <<XML
<note>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>
XML
xml = Nokogiri::XML(xml_text)
def significant_nodes(node)
return 0
end
describe "Stackoverflow Question" do
it "should return the number of significant nodes in nokogiri." do
expect(significant_nodes(xml.css('note'))).to eq 4
end
end
I want to know how to create the significant_nodes
function.
If I change the XML to:
<note>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
<footer></footer>
</note>
then when I create the Document
, I still would like the footer represented; using config.noblanks
is not an option.
You can create a query that only returns element nodes, and ignores text nodes. In XPath,
*
only returns elements, so the query could look like (querying the whole doc):or if you want to use CSS:
If you want to implement your
significant_nodes
method, you would need to make the query relative to the node passed in:I don’t know how to do a relative query with CSS, you might need to stick with XPath.
You can use the
NOBLANKS
option for parsing the XML string, consider this example:The
NOBLANKS
shouldn't remove empty nodes:As OP pointed out the documentation on the Nokogiri website (and also on the libxml website) about the parser options is quite cryptic, following a specification of the behaviour ot the
NOBLANKS
option:Nokogiri's noblanks config option doesn't remove all whitespace Text nodes when they have siblings:
I'm not sure why Nokogiri was programmed to work that way. I think it would be better to either ignore all whitespace Text nodes are don't ignore any Text nodes.