I am trying to figure out how to do aggregation with Spring Batch.
For example, I have a CSV file with a list of names:
name
John
Amy
John
Ryan
And I want name count in text file:
name, count
Amy, 1
John, 2
Ryan, 1
From what I learned from Spring Batch, the ETL batch process (itemReader -> ItemProcessor -> ItemWriter) is more like just a mapping phase in map-reduce lingo. How do I do the reduce(aggregation) phase in Spring Batch?
Is Spring Batch the right tool to use? Or should I use Spark for this? Thanks.
The ItemProcessor
is typically used to filter, validate or map data from one type to another, but can also be used for any kind of processing like counting in your case. For your example, the item processor can hold a map of name -> count
and count names as they go through the pipeline.
The chunk-oriented processing model does not map directly to the map-reduce model. However, partitioning is what you are looking for. The StepExecutionSplitter
and StepExecutionAggregator
are the key concepts to do map-reduce like operations either locally or remotely. More details on this in the Partitioning section of the reference documentation.
There is a similar question to this one, I'm adding it here for reference: Howto aggregate on full data set in Spring Batch jobs?
Hope this helps.