Node.js Example to convert Xml to JSON for large X

2019-05-29 08:58发布

问题:

I'm relatively new to Node.js. I'm trying to convert 83 XML files that are each around 400MB in size into JSON.

Each file contains data like this (except each element has a large number of additional statements):

<case-file>
  <serial-number>75563140</serial-number>
  <registration-number>0000000</registration-number>
  <transaction-date>20130101</transaction-date>
  <case-file-header>
     <filing-date>19981002</filing-date>
     <status-code>686</status-code>
     <status-date>20130101</status-date>
  </case-file-header>
  <case-file-statements>
     <case-file-statement>
        <type-code>D10000</type-code>
        <text>"MUSIC"</text>
     </case-file-statement>
     <case-file-statement>
        <type-code>GS0351</type-code>
        <text>compact discs</text>
     </case-file-statement>
  </case-file-statements>
  <case-file-event-statements>
     <case-file-event-statement>
        <code>PUBO</code>
        <type>A</type>
        <description-text>PUBLISHED FOR OPPOSITION</description-text>
        <date>20130101</date>
        <number>28</number>
     </case-file-event-statement>
     <case-file-event-statement>
        <code>NPUB</code>
        <type>O</type>
        <description-text>NOTICE OF PUBLICATION</description-text>
        <date>20121212</date>
        <number>27</number>
     </case-file-event-statement>
   </case-file-event-statements>

I have tried a lot of different Node modules, including sax, node-xml, node-expat and xml2json. Obviously, I need to stream the data from the file and pipe it through an XML parser and then convert it to JSON.

I have also tried reading a number of blogs, etc. attempting to explain, albeit superficially, how to parse Xml.

In the Node universe, I tried sax first but I can't figure out how to extract the data in a format that I can convert it to JSON. xml2json won't work on streams. node-xml looks encouraging but I can't figure out how it parses chunks in any manner that makes sense. node-expat points to libexpat documentation, which appears to requires a Ph.D. Node elementree does the same, pointing to the Python implementation but doesn't explain what has been implemented or how to use it.

Can someone point me to example that I could use to get started?

回答1:

I doubt this is still relevant after 2-3 years but in case anyone else stumbles on this, I would say xml-stream on NPM looked rather straightforward to me.

If you're a windows user who wants to avoid GYP however I tried adding a very simple solution using sax to extract children form an XML file one by one, it's called no-gyp-xml-stream and it may not have a lot of features, but it certainly is simple to use: https://www.npmjs.com/package/no-gyp-xml-stream



回答2:

Although this question is quite old, I am sharing my problem & solution which might be helpful to all who are trying to convert XML to JSON.

The actual problem here is not the conversion but processing huge XML files without having to hold them in memory at once.

Working with almost all widely used packages, I came across following problem -

  • A lot of packages support XML to JSON conversion covering all scenarios but they don't work well with large files.

  • Very few packages (like xml-flow, xml-stream) support large XML file conversion but the conversion process misses out few corner case scenarios where the conversion either fails or gives unpredictable JSON structure (explained in this SO question).

The ideal solution would be to combine the advantages from both the approaches which is exactly what I did and came up with xtreamer node package.

In simple words, xtreamer accepts repeating node just like xml-flow / xml-stream but emits repeating xml nodes instead of converted JSON. This provides following advantages -

  • We can pipe xtreamer with any readable stream as it extends transform stream.
  • The emitted XML nodes can be transferred to any XML to JSON parser to get desired JSON.
  • We can go one step further and hook up the JSON parser with xtreamer & it will invoke the JSON parser and emit JSON accordingly.
  • xtreamer has stream as its only dependency & being a transform stream extension, it can be piped with other streams flexibly.

What if XML structure is not fixed?

I managed to come up with another sax based node package xtagger which reads the XML file and provides the structure of the file in following format -

structure: { [name: string]: { [hierarchy: number]: number } };

This package allows to figure out the repeating node name which can then be passed to xtreamer for parsing.

I hope this helps. :)



回答3:

I guess that by now you have a working process, considering your last answer.

Anyway, if you've already successfully parsed the incoming data with SAX, the solution might simply put the data in an object of your design and use yourStream.write(JSON.stringify(yourObject)) to stream it out.