Sax Parsing strange element with nokogiri

2019-08-02 17:55发布

问题:

I want to sax-parse in nokogiri, but when it comes to parse xml element that have a long and crazy xml element name or a attribute on it.. then everthing goes crazy.

Fore instans if I like to parse this xml file and grab all the title element, how do I do that with nokogiri-sax.

<titles>
    <title xml:lang="sv">Arkivvetenskap</title>
    <title xml:lang="en">Archival science</title>
</titles>

回答1:

In your example, title is the name of the element. xml:lang="sv" is an attribute. This parser assumes there are no elements nested inside of title elements

require 'rubygems'
require 'nokogiri'

class MyDocument < Nokogiri::XML::SAX::Document
  def start_element(name, attrs)
    @attrs = attrs
    @content = ''
  end
  def end_element(name)
    if name == 'title'
      puts Hash[@attrs]['xml:lang']
      puts @content.inspect
      @content = nil
    end
  end
  def characters(string)
    @content << string if @content
  end
  def cdata_block(string)
    characters(string)
  end
end

parser = Nokogiri::XML::SAX::Parser.new(MyDocument.new)
parser.parse(DATA)

__END__
<titles>
    <title xml:lang="sv">Arkivvetenskap</title>
    <title xml:lang="en">Archival science</title>
</titles>

This prints

sv
"Arkivvetenskap"
en
"Archival science"

SAX parsing is usually way too complex. Because of that, I recommend Nokogiri's standard in-memory parser, or if you really need speed and memory efficiency, Nokogiri's Reader parser.

For comparison, here is a standard Nokogiri parser for the same document

require 'rubygems'
require 'nokogiri'

doc = Nokogiri::XML(DATA)
doc.css('title').each do |title|
  puts title['lang']
  puts title.text.to_s.inspect
end

__END__
<titles>
    <title xml:lang="sv">Arkivvetenskap</title>
    <title xml:lang="en">Archival science</title>
</titles>

And here is a reader parser for the same document

require 'rubygems'
require 'nokogiri'

reader = Nokogiri::XML::Reader(DATA)
while reader.read
  if reader.name == 'title' && reader.node_type == Nokogiri::XML::Reader::TYPE_ELEMENT
    puts reader.attribute('xml:lang')
    puts reader.inner_xml.inspect # TODO xml decode this, if necessary.
  end
end

__END__
<titles>
    <title xml:lang="sv">Arkivvetenskap</title>
    <title xml:lang="en">Archival science</title>
</titles>