I am trying to convert a data frame to RDD, then perform some operations below to return tuples:
df.rdd.map { t=>
(t._2 + "_" + t._3 , t)
}.take(5)
Then I got the error below. Anyone have any ideas? Thanks!
<console>:37: error: value _2 is not a member of org.apache.spark.sql.Row
(t._2 + "_" + t._3 , t)
^
When you convert a DataFrame to RDD, you get an
RDD[Row]
, so when you usemap
, your function receives aRow
as parameter. Therefore, you must use theRow
methods to access its members (note that the index starts from 0):You can view more examples and check all methods available for
Row
objects in the Spark scaladoc.Edit: I don't know the reason why you are doing this operation, but for concatenating String columns of a DataFrame you may consider the following option:
You can access every element of a Row like if it was a
List
orArray
, it means using(index)
, however you can use the methodget
also.For example: