I have a csv file that looks like this:
user acct prdCd city state
user1|acct01|A |Fairfax |VA
user1|acct02|B |Gettysburg|PA
user1|acct03|C |York |PA
user2|acct21|A |Reston |VA
user2|acct42|C |Fairfax |VA
user3|acct66|A |Reston |VA
In spark I transform it for key value pair like this:
val accts = sc.textFile("accts.csv").map(line => line.split('|'))
val acctFilter = accts.filter(a => a.length > 1).map(a => (a(0) , (a(1), a(2), a(3), a(4))))
acctFilter.groupByKey.collect
res0: Array[(String, Iterable[(String, String, String, String)])] = Array((user3,CompactBuffer((acct66,A,Reston,VA))), (user1,CompactBuffer((acct01,A,Fairfax,VA), (acct02,B,Gettysburg,PA), (acct03,C,York,PA))), (user2,CompactBuffer((acct21,A,Reston,VA), (acct42,C,Fairfax,VA))))
which is correct.
What I want to do now, is convert it to something like this:
user1 | {'acct01': {prdCd: 'A', city: 'Fairfax', state: 'VA'}, 'acct02': {prdCd: 'B', city: 'Gettysburg', state: 'PA'}, 'acct03': {prdCd: 'C', city: 'York', state: 'PA'}}
user3 | {'acct66': {prdCd: 'A', city: 'Reston', state: 'VA'}}
user2 | {'acct21': {prdCd: 'A', city: 'Reston', state: 'VA'}, 'acct42': {prdCd: 'c', city: 'Fairfax', state: 'VA'}}
I want to output the above result to a csv file formatted like that. So essentially have the key in one column, and then have the value pair in the 2nd delimited by a |. I also need to bring the column name down as well but not sure how to add a string to the beginning of an array element. And also how to use {} instead of ().
EDIT:
changed valFilter to this:
val acctFilter = accts.filter(a => a.length > 1).map(a => s"{${a(0)} , {acct: '${a(1)}', prdCd: '${a(2)}', city: '${a(3)}', state: '${a(4)}'}}")
which gives back this:
Array({user1 , {acct: 'acct01', prdCd: 'A', city: 'Fairfax', state: 'VA'}}, {user1 , {acct: 'acct02', prdCd: 'B', city: 'Gettysburg', state: 'PA'}}, {user1 , {acct: 'acct03', prdCd: 'C', city: 'York', state: 'PA'}}, {user2 , {acct: 'acct21', prdCd: 'A', city: 'Reston', state: 'VA'}}, {user2 , {acct: 'acct42', prdCd: 'C', city: 'Fairfax', state: 'VA'}}, {user3 , {acct: 'acct66', prdCd: 'A', city: 'Reston', state: 'VA'}})
now I can't seem to do a groupByKey on this...gives me back this error:
value groupByKey is not a member of org.apache.spark.rdd.RDD[String]
I think I understand why I get this error because I convert the array to a string to there is no "key" really. So do I do groupByKey first and then convert to an array? How do I do that since one customer could have 1 account and another could have 3 accounts?