解析大型XML文件瓦特/红宝石和引入nokogiri(Parsing Large XML files

我有一个大的XML文件（大约10K行）我需要定期解析的格式如下：

<summarysection>
    <totalcount>10000</totalcount>
</summarysection>
<items>
     <item>
         <cat>Category</cat>
         <name>Name 1</name>
         <value>Val 1</value>
     </item>
     ...... 10,000 more times
</items>

我想要做的是分析每个使用引入nokogiri单个节点的计算在一个类别项目的数量。然后，我想减去从TOTAL_COUNT这个数字得到一个输出中，上面写着“Interest_Category的计数：N，伯爵的一切：Z”。

现在，这是我的代码：

#!/usr/bin/ruby

require 'rubygems'
require 'nokogiri'
require 'open-uri'

icount = 0 
xmlfeed = Nokogiri::XML(open("/path/to/file/all.xml"))
all_items = xmlfeed.xpath("//items")

  all_items.each do |adv|
            if (adv.children.filter("cat").first.child.inner_text.include? "partofcatname")
                icount = icount + 1
            end
  end

othercount = xmlfeed.xpath("//totalcount").inner_text.to_i - icount 

puts icount
puts othercount

这似乎是工作，但速度很慢！我10000项交谈超过10分钟。有一个更好的方法吗？我做的最佳不到时尚的东西吗？

Answer 1:

您可以大大减少你的时间，通过改变你的代码如下执行。只是改变了“99”到任何类别要检查：

require 'rubygems'
require 'nokogiri'
require 'open-uri'

icount = 0 
xmlfeed = Nokogiri::XML(open("test.xml"))
items = xmlfeed.xpath("//item")
items.each do |item|
  text = item.children.children.first.text  
  if ( text =~ /99/ )
    icount += 1
  end
end

othercount = xmlfeed.xpath("//totalcount").inner_text.to_i - icount 

puts icount
puts othercount

这花了大约三秒钟我的机器上。我认为你犯了一个错误的关键是，你选择了“项目”迭代，而不是创建的“项目”节点的集合。这使你的迭代码尴尬和缓慢。

Answer 2:

下面是比较基于DOM的计数SAX解析器计数的示例，计数500000 <item>与七个类别中的一个第首先，输出：

创建XML文件：1.7S
通过SAX字数：12.9s
创建DOM：1.6秒
通过DOM字数：2.5秒

这两种技术都产生相同的散列计数看到每个类别的数目：

{"Cats"=>71423, "Llamas"=>71290, "Pigs"=>71730, "Sheep"=>71491, "Dogs"=>71331, "Cows"=>71536, "Hogs"=>71199}

在SAX版本需要12.9s计数和分类，而DOM版本仅需1.6秒创建DOM元素和2.5s的更多发现和分类所有的<cat>值。该DOM版本大约是3倍的速度！

......但是这不是故事的全部。我们必须在内存的使用看为好。

50万项SAX（12.9s）在RAM 238MB峰; 在1.0GB DOM（4.1s）的峰值。
1,000,000项SAX（25.5s）在RAM 243MB峰; 在2.0GB DOM（8.1s）的峰值。
为200万项SAX（55.1s）在RAM 250MB峰值; DOM（???）在3.2GB峰。

我有我的机器来处理百万个空间不足，但在200万我跑出的RAM，不得不开始使用虚拟内存。即使与SSD和快速的机器让我几乎十分钟DOM代码运行最后杀死它。

这很可能是您所报告的长时间是因为你运行的RAM，不断击中磁盘作为虚拟内存的一部分。如果你能适应DOM到内存中，使用它，因为它是快。如果不能，不过，你真的必须使用SAX版本。

下面是测试代码：

require 'nokogiri'

CATEGORIES = %w[ Cats Dogs Hogs Cows Sheep Pigs Llamas ]
ITEM_COUNT = 500_000

def test!
  create_xml
  sleep 2; GC.start # Time to read memory before cleaning the slate
  test_sax
  sleep 2; GC.start # Time to read memory before cleaning the slate
  test_dom
end

def time(label)
  t1 = Time.now
  yield.tap{ puts "%s: %.1fs" % [ label, Time.now-t1 ] }
end

def test_sax
  item_counts = time("Count via SAX") do
    counter = CategoryCounter.new
    # Use parse_file so we can stream data from disk instead of flooding RAM
    Nokogiri::HTML::SAX::Parser.new(counter).parse_file('tmp.xml')
    counter.category_counts
  end
  # p item_counts
end

def test_dom
  doc = time("Create DOM"){ File.open('tmp.xml','r'){ |f| Nokogiri.XML(f) } }
  counts = time("Count via DOM") do
    counts = Hash.new(0)
    doc.xpath('//cat').each do |cat|
      counts[cat.children[0].content] += 1
    end
    counts
  end
  # p counts
end

class CategoryCounter < Nokogiri::XML::SAX::Document
  attr_reader :category_counts
  def initialize
    @category_counts = Hash.new(0)
  end
  def start_element(name,att=nil)
    @count = name=='cat'
  end
  def characters(str)
    if @count
      @category_counts[str] += 1
      @count = false
    end
  end
end

def create_xml
  time("Create XML file") do
    File.open('tmp.xml','w') do |f|
      f << "<root>
      <summarysection><totalcount>10000</totalcount></summarysection>
      <items>
      #{
        ITEM_COUNT.times.map{ |i|
          "<item>
            <cat>#{CATEGORIES.sample}</cat>
            <name>Name #{i}</name>
            <name>Value #{i}</name>
          </item>"
        }.join("\n")
      }
      </items>
      </root>"
    end
  end
end

test! if __FILE__ == $0

如何在DOM计数工作？

如果我们剥去一些测试结构，基于DOM的计数器看起来是这样的：

# Open the file on disk and pass it to Nokogiri so that it can stream read;
# Better than  doc = Nokogiri.XML(IO.read('tmp.xml'))
# which requires us to load a huge string into memory just to parse it
doc = File.open('tmp.xml','r'){ |f| Nokogiri.XML(f) }

# Create a hash with default '0' values for any 'missing' keys
counts = Hash.new(0) 

# Find every `<cat>` element in the document (assumes one per <item>)
doc.xpath('//cat').each do |cat|
  # Get the child text node's content and use it as the key to the hash
  counts[cat.children[0].content] += 1
end

如何在SAX计数工作？

首先，让我们专注于这样的代码：

class CategoryCounter < Nokogiri::XML::SAX::Document
  attr_reader :category_counts
  def initialize
    @category_counts = Hash.new(0)
  end
  def start_element(name,att=nil)
    @count = name=='cat'
  end
  def characters(str)
    if @count
      @category_counts[str] += 1
      @count = false
    end
  end
end

当我们创建这个类的一个新实例，我们得到的是有一个哈希对象，默认为0所有值，和一对夫妇的，可以在它被调用的方法。因为它贯穿于文档的SAX解析器将调用这些方法。

每次SAX解析器看到一个新的元素，它将调用start_element这一类方法。当发生这种情况，我们设置基于此元素是否被命名为“猫”或没有（这样我们就可以发现它后的名称）的标志。
每个SAX解析器吸食了文本块时它调用的characters我们的对象的方法。当发生这种情况，我们检查，如果我们看到的最后一个元素是一个类（即如果@count设置为true ）; 如果是这样，我们使用的类别名称本文节点的值，并添加一个到我们的柜台。

要使用我们的自定义对象与引入nokogiri的SAX解析器，我们这样做：

# Create a new instance, with its empty hash
counter = CategoryCounter.new

# Create a new parser that will call methods on our object, and then
# use `parse_file` so that it streams data from disk instead of flooding RAM
Nokogiri::HTML::SAX::Parser.new(counter).parse_file('tmp.xml')

# Once that's done, we can get the hash of category counts back from our object
counts = counter.category_counts
p counts["Pigs"]