tool to extract data structures from unclean data

2019-08-23 13:30发布

I have unstructured geneally unclean data in a database field. There are common structures which are consistent in the data

namely:

field:

name:value 

fieldset: 

nombre <FieldSet>
field,
  .
  .
  .
field(n)

table

nombre <table>
head(1)... head(n)
val(1)...  val(n)
      .
      .
      .

I was wondering if there was a tool (preferably in Java) that could extract learn/understand these data structures, parse the file and convert to a Map or object which I could run validation checks on?

I am aware of Antlr but understand this is more geared towards tree construction, an not independent bits of data (am I wrong about this?)

Does anyone have any suggestions for the problem as a whole?

3条回答
该账号已被封号
2楼-- · 2019-08-23 14:02

We ended up using antlr for this, it required us to make multiple lexers where one lexer would manipulated the input for the next lexer.

Another project is pads - wrote in C

查看更多
放荡不羁爱自由
3楼-- · 2019-08-23 14:10

You should use "bnflite" https://github.com/r35382/bnflite Using this template library you need to develop BNF like gramma for your text by means of classes and overloaded operators directly in C++ code. The benefit is that such gramma is easily adjustable to your source

查看更多
迷人小祖宗
4楼-- · 2019-08-23 14:25

I recommend Talend. It is very versatile, open source data integration tool. It is based on java. You can use build in tools/components to extract data from unstructured data sources. You can also write complex custom java code to do what you want.

I used Talend in couple of scientific proof of concept projects of mine. It worked for me. Good part is, it is free!

查看更多
登录 后发表回答