I'm trying to use Ruby's Nokogiri to parse large (1 GB or more) XML files. I'm testing code on a smaller file, containing only 4 records available here. I'm using Nokogiri version 1.5.0, Ruby 1.8.7 on Ubuntu 10.10. Since I don't understand SAX very well, I'm trying Nokogiri::XML::Reader to start.
My first attempt, to retrieve the content of the PMID tag, looks like this:
#!/usr/bin/ruby
require "rubygems"
require "nokogiri"
file = ARGV[0]
reader = Nokogiri::XML::Reader(File.open(file))
p = []
reader.each do |node|
if node.name == "PMID"
p << node.inner_xml
end
end
puts p.inspect
Here's what I hoped to see:
["21714156", "21693734", "21692271", "21692260"]
Here's what I actually saw:
["21714156", "", "21693734", "", "21692271", "", "21692260", ""]
It seems that for some reason, my code is finding, or generating, an extra, empty PMID tag for every instance of PMID. Either that or inner_xml
does not work as I thought.
I'd be grateful if anyone could confirm that my code and data generates the result shown and suggest where I'm going wrong.