-->

Pentaho Kettle split CSV into multiple records

2019-07-27 18:02发布

问题:

I'm new to Kettle, but getting on well with it so far. However I can't figure out how to do this.

I have a csv which looks something like this

a, col1, col2, col3
a, col1, col2, col3
a, col1, col2, col3
b, col1, col2, col3
b, col1, col2, col3
c, col1, col2, col3
c, col1, col2, col3

The first column starts with a key (a,b,c), and then the rest of the columns follow. What I want to do is read in the csv (got that covered) and then split the csv based on key, so I have 3 chunks/ groups of data and then convert each of those chunks of data into a separate json file, which I think I can get.

What I can't get my head around is the grouping the data and then performing a separate action (convert to json) on each of those separate groups. Its not the creating json I have an issue with.

The data is from a sensor network of many environmental sensors so there are many keys, hundreds, and new ones get added. I've used map reduce to process this data before as the concept of partitioning is what I'm trying to replicate here, without using the hadoop elements of kettle as the deployment is different. Once I've partitioned the data it needs to be loaded into different places as seperate records. The key is a unique ID (serial number) of a sensor.

Any ideas please?

Thanks

回答1:

I guess create a javascript to output the fields of a row in a JSON like string added to the row:

{"id":"a","col1":"1","col2":"2","col3":"3"}

Next you could use the group step and set the base field to the 'id' field and have as aggregate the javascript value in type 'Concatenate strings separated by ,'

{"id":"a","col1":"1","col2":"2","col3":"3"},{"id":"a","col1":"4","col2":"5","col3":"6"}, {"id":"a","col1":"7","col2":"8","col3":"9"}

Add some tags around it and you have valid json. Next you could assemble a file name using javascript step:

var file_name="C:\\dir\\"+ id + ".txt";

Use the text file output and set the file name field to 'file_name'. Remove separator / enclosure options to have none extra formatting and you are done.



回答2:

If i have understood your question correctly, you can use "GROUP BY" step to group the columns (i.e. the first header in your data set) and then store these into memory.

Once this is done.. use parameter looping to "get the variables" and dynamically generate multiple JSON output. Check the image below:

In the JSON output step, use variables like header1 to generate multiple files. Highlighted below the changes i made in the JSON Output.

In case you find in confusing, i have uploaded a sample code in here.

Hope it helps :)