SparkSession.createDataset()
only allows List, RDD, or Seq
- but it doesn't support JavaPairRDD
.
So if I have a JavaPairRDD<String, User>
that I want to create a Dataset
from, would a viable workround for the SparkSession.createDataset()
limitation to create a wrapper UserMap
class that contains two fields: String
and User
.
Then do spark.createDataset(userMap, Encoders.bean(UserMap.class));
?
If you can convert the JavaPairRDD
to List<Tuple2<K, V>>
then you can use createDataset method which takes List. See below sample code.
JavaPairRDD<String, User> pairRDD = ...;
Dataset<Row> df = spark.createDataset(pairRDD.collect(), Encoders.tuple(Encoders.STRING(),Encoders.bean(User.class))).toDF("key","value");
or you can convert to RDD
Dataset<Row> df = spark.createDataset(JavaPairRDD.toRDD(pairRDD), Encoders.tuple(Encoders.STRING(),Encoders.bean(User.class))).toDF("key","value");