I need to parse XML files of 40GB in size, and then normalize, and insert to a MySQL database. How much of the file I need to store in the database is not clear, neither do I know the XML structure.
Which parser should I use, and how would you go about doing this?
In PHP, you can read in extreme large XML files with the
XMLReader
Docs:Extreme large XML files should be stored in a compressed format on disk. At least this makes sense as XML files have a high compression ratio. For example gzipped like
large.xml.gz
.PHP supports that quite well with
XMLReader
via the compression wrappersDocs:The
XMLReader
allows you to operate on the current element "only". That means it's forward-only. If you need to keep parser state, you need to build it your own.I often find it helpful to wrap the basic movements into a set of iterators that know how to operate on
XMLReader
like iterating through elements or child-elements only. You find this outlined in Parse XML with PHP and XMLReader.See as well:
It would be nice to know what you actually intend to do with the XML. The way you parse it depends very much on the processing you need to carry out, as well as the size.
If this is a one-off task, then I've started in the past by discovering the XML structure before doing anything else. My DTDGenerator (see saxon.sf.net) was written for this purpose a long time ago and still does the job, there are other tools available now but I don't know whether they do streamed processing which is a prerequisite here.
You can write an application that processes the data using either a pull or push streamed parser (SAX or StAX). How easy this is depends on how much processing you have to do and how much state you have to maintain, which you haven't told us. Alternatively you could try streamed XSLT processing, which is available in Saxon-EE.