@Before There will be probably some duplicate questions suggestions, I don't think that is the case maybe read this first, I'll try to be as brief as possible. Title gives basic idea.
Here is an example XML(case 1) :
<root>
<Item>
<ItemID>4504216603</ItemID>
<ListingDetails>
<StartTime>10:00:10.000Z</StartTime>
<EndTime>10:00:30.000Z</EndTime>
<ViewItemURL>http://url</ViewItemURL>
....
</item>
Here is an example XML(case 2) :
<Item>
<ItemID>4504216604</ItemID>
<ListingDetails>
<StartTime>10:30:10.000Z</StartTime>
<!-- Start difference from case 1 -->
<averages>
<AverageTime>value1</AverageTime>
<category type="TX">9823</category>
<category type="TY">9112</category>
<AveragePrice>value2</AveragePrice>
</averages>
<!-- End difference from case 1 -->
<EndTime>11:00:10.000Z</EndTime>
<ViewItemURL>http://url</ViewItemURL>
....
</item>
</root>
I borrowed this XML from google, anyways my objects are not always the same, sometimes there are extra elements like in case2. Now I'd like to produce CSV like this from both cases:
ItemID,StartTime,EndTime,ViewItemURL,AverageTime,AveragePrice
4504216603,10:00:10.000Z,10:00:30.000Z,http://url
4504216604,10:30:10.000Z,11:00:10.000Z,http://url,value1,value2
This 1st line is header it should also be included in csv. I got some useful links to stax today, I don't really don't know what is the right/optimal approach for this, I'm struggling with this for 3 days now, not really willing to give up yet.
Tell me what you think how would you solve this
I forgot to mention this is very huge xml file up to 1gb
BOUNTY UPDATE :
I'm looking for more Generic approach, meaning that this should work for any number of nodes with any depth, and sometimes as in the example xml, it can happen that one item
object has greater number of nodes than the next/previous one so there should be also case for that(so all columns and values match in CSV).
Also it can happen that nodes have the same name/localName but different values and attributes, if that is the case then new column should appear in CSV with appropriate value. (I added example of this case inside <averages>
tag called category
)
I'm not sure I understand how generic the solution should be. Do you really want to parse a 1 GB file twice for a generic solution? And if you want something generic, why did you skipped the
<category>
element in your example? How much different format do you need to handle? Do you really not know what the format can be (even if some element can be ommited)? Can you clarify?To my experience, it's generally preferable to parse specific files in a specific way (this doesn't exclude using a generic API though). My answer will go in this direction (and I'll update it after the clarification).
If you don't feel comfortable with XML, you could consider using some existing (commercial) libraries, for example Ricebridge XML Manager and CSV Manager. See How to convert CSV into XML and XML into CSV using Java for a full example. The approach is pretty straightforward: you define the data fields using XPath expressions (which is perfect in your case since you can have "extra" elements), parse the the file and then pass the result
List
to the CSV component to generate the CSV file. The API looks simple, the code tested (the source code of their test cases is available under a BSD-style license), they claim supporting gigabyte-sized files.You can get a Single Developer license for $170 which is not very expensive compared to developer daily rates.
They offer 30 days trial versions, have a look.
Another option would be to use Spring Batch. Spring batch offers everything required to work with XML files as input or output (using StAX and the XML binding framework of your choice) and flat files as input or output. See:
You could also use Smooks to do XML to CSV transformations. See also:
Another option would be to roll your own solution, using a StAX parser or, why not, using VTD-XML and XPath. Have a look at:
The best way to code based on your described requirement is to use the easy feature of FreeMarker and XML processing. See the docs.
In this case you will only need the template that will produce a CSV.
An alternative to this is XMLGen, but very similar in approach. Just look at that diagram and examples, and instead of SQL statements, you will output CSV.
These two similar approaches are not "conventional" but do the job very quickly for your situation, and you don't have to learn XSL (quite hard to master I think).
Note that this would be a prime example of using XSLT except that most XSLT processors read in the whole XML file into memory which is not an option as it is large. Note, however, that the enterprise version of Saxon can do streaming XSLT processing (if the XSLT script adheres to the restrictions).
You may also want to use an external XSLT processor outside your JVM instead, if applicable. This opens up for several more options.
Streaming in Saxon-EE: http://www.saxonica.com/documentation/sourcedocs/serial.html
The code provided should be considered a sketch rather than the definitive article. I am not an expert on SAX and the implementation could be improved for better performance, simpler code etc. That said SAX should be able to cope with streaming large XML files.
I would approach this problem with 2 passes using the SAX parser. (Incidentally, I would also use a CSV generating library to create the output as this would deal with all the fiddly character escaping that CSV involves but I haven't implemented this in my sketch).
First pass: Establish number of header columns
Second pass: Output CSV
I assume that the XML file is well formed. I assume that we don't have a scheme/DTD with a predefined order.
In the first pass I have assumed that a CSV column will be added for every XML element containing text content or for any attribute (I have assumed attributes will contain something!).
The second pass, having established the number of target columns, will do the actual CSV output.
Based on your example XML my code sketch would produce:
Please note I have used the google collections LinkedHashMultimap as this is helpful when associating multiple values with a single key. I hope you find this useful!
I am not convinced that SAX is the best approach for you. There are different ways you could use SAX here, though.
If element order is not guaranteed within certain elements, like ListingDetails, then you need to be proactive.
When you start a ListingDetails, initialize a map as a member variable on the handler. In each subelement, set the appropriate key-value in that map. When you finish a ListingDetails, examine the map and explicitly mock values such as nulls for the missing elements. Assuming you have one ListingDetails per item, save it to a member variable in the handler.
Now, when your item element is over, have a function that writes the line of CSVs based on the map in the order you wanted.
The risk with this is if you have corrupted XML. I would strongly consider setting all these variables to null when an item starts, and then checking for errors and announcing them when the item ends.
Here some code that implements the conversion of the XML to CSV using StAX. Although the XML you gave is only an example, I hope that this shows you how to handle the optional elements.