I have a large XML file (about 10K rows) I need to parse regularly that is in this format:
<summarysection>
<totalcount>10000</totalcount>
</summarysection>
<items>
<item>
<cat>Category</cat>
<name>Name 1</name>
<value>Val 1</value>
</item>
...... 10,000 more times
</items>
What I'd like to do is parse each of the individual nodes using nokogiri to count the amount of items in one category. Then, I'd like to subtract that number from the total_count to get an ouput that reads "Count of Interest_Category: n, Count of All Else: z".
This is my code now:
#!/usr/bin/ruby
require 'rubygems'
require 'nokogiri'
require 'open-uri'
icount = 0
xmlfeed = Nokogiri::XML(open("/path/to/file/all.xml"))
all_items = xmlfeed.xpath("//items")
all_items.each do |adv|
if (adv.children.filter("cat").first.child.inner_text.include? "partofcatname")
icount = icount + 1
end
end
othercount = xmlfeed.xpath("//totalcount").inner_text.to_i - icount
puts icount
puts othercount
This seems to work, but is very slow! I'm talking more than 10 minutes for 10,000 items. Is there a better way to do this? Am I doing something in a less than optimal fashion?
Check out Greg Weber's version of Paul Dix's sax-machine gem: http://blog.gregweber.info/posts/2011-06-03-high-performance-rb-part1
Parsing large file with SaxMachine seems to be loading the whole file into memory
sax-machine makes the code much much simpler; Greg's variant makes it stream.
I'd recommend using a SAX parser rather than a DOM parser for a file this large. Nokogiri has a nice SAX parser built in: http://nokogiri.org/Nokogiri/XML/SAX.html
The SAX way of doing things is nice for large files simply because it doesn't build a giant DOM tree, which in your case is overkill; you can build up your own structures when events fire (for counting nodes, for example).
You can dramatically decrease your time to execute by changing your code to the following. Just change the "99" to whatever category you want to check.:
This took about three seconds on my machine. I think a key error you made was that you chose the "items" iterate over instead of creating a collection of the "item" nodes. That made your iteration code awkward and slow.
Here's an example comparing a SAX parser count with a DOM-based count, counting 500,000
<item>
s with one of seven categories. First, the output:Both techniques produce the same hash counting the number of each category seen:
The SAX version takes 12.9s to count and categorize, while the DOM version takes only 1.6s to create the DOM elements and 2.5s more to find and categorize all the
<cat>
values. The DOM version is around 3x as fast!…but that's not the entire story. We have to look at RAM usage as well.
I had enough memory on my machine to handle 1,000,000 items, but at 2,000,000 I ran out of RAM and had to start using virtual memory. Even with an SSD and a fast machine I let the DOM code run for almost ten minutes before finally killing it.
It is very likely that the long times you are reporting are because you are running out of RAM and hitting the disk continuously as part of virtual memory. If you can fit the DOM into memory, use it, as it is FAST. If you can't, however, you really have to use the SAX version.
Here's the test code:
How does the DOM Counting Work?
If we strip away some of the test structure, the DOM-based counter looks like this:
How does the SAX counting Work?
First, let's focus on this code:
When we create a new instance of this class we get an object that has a Hash that defaults to 0 for all values, and a couple of methods that can be called on it. The SAX Parser will call these methods as it runs through the document.
Each time the SAX parser sees a new element it will call the
start_element
method on this class. When that happens, we set a flag based on whether this element is named "cat" or not (so that we can find the name of it later).Each time the SAX parser slurps up a chunk of text it calls the
characters
method of our object. When that happens, we check to see if the last element we saw was a category (i.e. if@count
was set totrue
); if so, we use the value of this text node as the category name and add one to our counter.To use our custom object with Nokogiri's SAX parser we do this:
you may like to try this out - https://github.com/amolpujari/reading-huge-xml
HugeXML.read xml, elements_lookup do |element| # => element{ :name, :value, :attributes} end
I also tried using ox