I'm trying to use Ruby's Nokogiri to parse large (1 GB or more) XML files. I'm testing code on a smaller file, containing only 4 records available here. I'm using Nokogiri version 1.5.0, Ruby 1.8.7 on Ubuntu 10.10. Since I don't understand SAX very well, I'm trying Nokogiri::XML::Reader to start.
My first attempt, to retrieve the content of the PMID tag, looks like this:
#!/usr/bin/ruby
require "rubygems"
require "nokogiri"
file = ARGV[0]
reader = Nokogiri::XML::Reader(File.open(file))
p = []
reader.each do |node|
if node.name == "PMID"
p << node.inner_xml
end
end
puts p.inspect
Here's what I hoped to see:
["21714156", "21693734", "21692271", "21692260"]
Here's what I actually saw:
["21714156", "", "21693734", "", "21692271", "", "21692260", ""]
It seems that for some reason, my code is finding, or generating, an extra, empty PMID tag for every instance of PMID. Either that or inner_xml
does not work as I thought.
I'd be grateful if anyone could confirm that my code and data generates the result shown and suggest where I'm going wrong.
Each element in the stream comes through as two events: one to open the element and one to close it. The opening event will have
and the closing event will have
The empty strings you're seeing are just the element closing events. Remember that with SAX parsing, you're basically walking through a tree so you need the second event to tell you when you're going back up and closing an element.
You probably want something more like this:
Or perhaps:
Or some other variation on that.