Parsing very large xml lazily

2020-02-28 05:37发布

问题:

I have a huge xml file (40 gbs). I would like to extract some fields from it without loading the entire file into memory. Any suggestions?

回答1:

A quick example with XMLEventReader based on a tutorial for SAXParser here (as posted by Rinat Tainov).

I'm sure it can be done better but just to show basic usage:

import scala.io.Source
import scala.xml.pull._

object Main extends App {
  val xml = new XMLEventReader(Source.fromFile("test.xml"))

  def printText(text: String, currNode: List[String]) {
    currNode match {
      case List("firstname", "staff", "company") => println("First Name: " + text)
      case List("lastname", "staff", "company") => println("Last Name: " + text)
      case List("nickname", "staff", "company") => println("Nick Name: " + text)
      case List("salary", "staff", "company") => println("Salary: " + text)
      case _ => ()
    }
  }

  def parse(xml: XMLEventReader) {
    def loop(currNode: List[String]) {
      if (xml.hasNext) {
        xml.next match {
          case EvElemStart(_, label, _, _) =>
            println("Start element: " + label)
            loop(label :: currNode)
          case EvElemEnd(_, label) =>
            println("End element: " + label)
            loop(currNode.tail)
          case EvText(text) =>
            printText(text, currNode)
            loop(currNode)
          case _ => loop(currNode)
        }
      }
    }
    loop(List.empty)
  }

  parse(xml)
}


回答2:

User SAXParser, it will not load entire xml to memory. Here good java example, easily can be used in scala.



回答3:

If you are happy looking at alternative xml libraries then Scales Xml provides three main pull parsing approaches:

  1. Iterator based - simply use hasNext, next to get more items
  2. iterate function - provides an Iterator but for trees identified by a simple path
  3. Iteratee based - allows combinations of multiple paths

The focus of the upcoming 0.5 version is asynchronous parsing via aalto-xml, allowing for additional non-blocking control options.

In all cases you can control both memory usage and how the document is processed with Scales.