I have above code as Spark driver, when I execute my program it works properly saving required data as Parquet file.
String indexFile = "index.txt";
JavaRDD<String> indexData = sc.textFile(indexFile).cache();
JavaRDD<String> jsonStringRDD = indexData.map(new Function<String, String>() {
@Override
public String call(String patientId) throws Exception {
return "json array as string"
}
});
//1. Read json string array into a Dataframe (execution 1)
DataFrame dataSchemaDF = sqlContext.read().json(jsonStringRDD );
//2. Save dataframe as parquet file (execution 2)
dataSchemaDF.write().parquet("md.parquet");
But I observed my mapper function on RDD indexData
is getting executed twice.
first, when I read jsonStringRdd
as DataFrame
using SQLContext
Second, when I write the dataSchemaDF
to the parquet file
Can you guide me on this, how to avoid this repeated execution? Is there any other better way of converting JSON string into a Dataframe?