I have flat file with the following structure:
key1|"value-001"
key2|"value-002"
key2|"value-003"
key3|"value-004"
key2|"value-005"
key1|"value-006"
key3|"value-007"
I need to map this data file to key-value pairs where value will be list of values for one key, such as:
key1:["value-001","value-006"]
key2:["value-002","value-003","value-005"]
key3:["value-004","value-007"]
I need do this from Java code. As I understood from Spark Programming Guide this operation should be implemented by sc.flatMapValues(..)
, sc.flatMap(..)
or sc.groupByKey(..)
but I don't know which one. How do I do this?
I would recommend
reduceByKey
:) This list imitates your input:Converting to rdd (you will of course just read in your file with
sc.textFile()
)We now have an RDD of strings. The following maps to key-value pairs (note the value is being added to a list) and then
reduceByKey
combines all values for each key into a list, yielding the result you want.EDIT: I feel I should mention that you could also use a
groupByKey
. However, you usually want to favorreduceByKey
overgroupByKey
becausereduceByKey
does a map-side reduce BEFORE shuffling the data around, whereasgroupByKey
shuffles everything around. In your particular case, you will probably end up shuffling the same amount of data around as with agroupByKey
since you want all values to be gathered, but usingreduceByKey
is just a better habit to be in :)