I have unstructured geneally unclean data in a database field. There are common structures which are consistent in the data
namely:
field:
name:value
fieldset:
nombre <FieldSet>
field,
.
.
.
field(n)
table
nombre <table>
head(1)... head(n)
val(1)... val(n)
.
.
.
I was wondering if there was a tool (preferably in Java) that could extract learn/understand these data structures, parse the file and convert to a Map or object which I could run validation checks on?
I am aware of Antlr but understand this is more geared towards tree construction, an not independent bits of data (am I wrong about this?)
Does anyone have any suggestions for the problem as a whole?
We ended up using antlr for this, it required us to make multiple lexers where one lexer would manipulated the input for the next lexer.
Another project is pads - wrote in C
You should use "bnflite" https://github.com/r35382/bnflite Using this template library you need to develop BNF like gramma for your text by means of classes and overloaded operators directly in C++ code. The benefit is that such gramma is easily adjustable to your source
I recommend Talend. It is very versatile, open source data integration tool. It is based on java. You can use build in tools/components to extract data from unstructured data sources. You can also write complex custom java code to do what you want.
I used Talend in couple of scientific proof of concept projects of mine. It worked for me. Good part is, it is free!