formatting a spark array..adding string to element

2019-08-19 01:00发布

问题:

I have a csv file that looks like this:

user   acct  prdCd  city      state
user1|acct01|A    |Fairfax   |VA
user1|acct02|B    |Gettysburg|PA
user1|acct03|C    |York      |PA
user2|acct21|A    |Reston    |VA
user2|acct42|C    |Fairfax   |VA
user3|acct66|A    |Reston    |VA

In spark I transform it for key value pair like this:

val accts = sc.textFile("accts.csv").map(line => line.split('|'))
val acctFilter = accts.filter(a => a.length > 1).map(a => (a(0) , (a(1), a(2), a(3), a(4)))) 
acctFilter.groupByKey.collect

res0: Array[(String, Iterable[(String, String, String, String)])] = Array((user3,CompactBuffer((acct66,A,Reston,VA))), (user1,CompactBuffer((acct01,A,Fairfax,VA), (acct02,B,Gettysburg,PA), (acct03,C,York,PA))), (user2,CompactBuffer((acct21,A,Reston,VA), (acct42,C,Fairfax,VA))))

which is correct.

What I want to do now, is convert it to something like this:

user1 | {'acct01': {prdCd: 'A', city: 'Fairfax', state: 'VA'}, 'acct02': {prdCd: 'B', city: 'Gettysburg', state: 'PA'}, 'acct03': {prdCd: 'C', city: 'York', state: 'PA'}}
user3 | {'acct66': {prdCd: 'A', city: 'Reston', state: 'VA'}}
user2 | {'acct21': {prdCd: 'A', city: 'Reston', state: 'VA'}, 'acct42': {prdCd: 'c', city: 'Fairfax', state: 'VA'}}

I want to output the above result to a csv file formatted like that. So essentially have the key in one column, and then have the value pair in the 2nd delimited by a |. I also need to bring the column name down as well but not sure how to add a string to the beginning of an array element. And also how to use {} instead of ().

EDIT:

changed valFilter to this:

val acctFilter = accts.filter(a => a.length > 1).map(a => s"{${a(0)} , {acct: '${a(1)}', prdCd: '${a(2)}', city: '${a(3)}', state: '${a(4)}'}}") 

which gives back this:

Array({user1 , {acct: 'acct01', prdCd: 'A', city: 'Fairfax', state: 'VA'}}, {user1 , {acct: 'acct02', prdCd: 'B', city: 'Gettysburg', state: 'PA'}}, {user1 , {acct: 'acct03', prdCd: 'C', city: 'York', state: 'PA'}}, {user2 , {acct: 'acct21', prdCd: 'A', city: 'Reston', state: 'VA'}}, {user2 , {acct: 'acct42', prdCd: 'C', city: 'Fairfax', state: 'VA'}}, {user3 , {acct: 'acct66', prdCd: 'A', city: 'Reston', state: 'VA'}})

now I can't seem to do a groupByKey on this...gives me back this error:

 value groupByKey is not a member of org.apache.spark.rdd.RDD[String]

I think I understand why I get this error because I convert the array to a string to there is no "key" really. So do I do groupByKey first and then convert to an array? How do I do that since one customer could have 1 account and another could have 3 accounts?