I have a huge xml file (40 gbs). I would like to extract some fields from it without loading the entire file into memory. Any suggestions?
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
回答1:
A quick example with XMLEventReader based on a tutorial for SAXParser here (as posted by Rinat Tainov).
I'm sure it can be done better but just to show basic usage:
import scala.io.Source
import scala.xml.pull._
object Main extends App {
val xml = new XMLEventReader(Source.fromFile("test.xml"))
def printText(text: String, currNode: List[String]) {
currNode match {
case List("firstname", "staff", "company") => println("First Name: " + text)
case List("lastname", "staff", "company") => println("Last Name: " + text)
case List("nickname", "staff", "company") => println("Nick Name: " + text)
case List("salary", "staff", "company") => println("Salary: " + text)
case _ => ()
}
}
def parse(xml: XMLEventReader) {
def loop(currNode: List[String]) {
if (xml.hasNext) {
xml.next match {
case EvElemStart(_, label, _, _) =>
println("Start element: " + label)
loop(label :: currNode)
case EvElemEnd(_, label) =>
println("End element: " + label)
loop(currNode.tail)
case EvText(text) =>
printText(text, currNode)
loop(currNode)
case _ => loop(currNode)
}
}
}
loop(List.empty)
}
parse(xml)
}
回答2:
User SAXParser, it will not load entire xml to memory. Here good java example, easily can be used in scala.
回答3:
If you are happy looking at alternative xml libraries then Scales Xml provides three main pull parsing approaches:
- Iterator based - simply use hasNext, next to get more items
- iterate function - provides an Iterator but for trees identified by a simple path
- Iteratee based - allows combinations of multiple paths
The focus of the upcoming 0.5 version is asynchronous parsing via aalto-xml, allowing for additional non-blocking control options.
In all cases you can control both memory usage and how the document is processed with Scales.