I'm wondering what the runtime of Spark is when sampling a RDD/DF compared with the runtime of the full RDD/DF. I don't know if it makes a difference but I'm currently using Java + Spark 1.5.1 + Hadoop 2.6.
JavaRDD<Row> rdd = sc.textFile(HdfsDirectoryPath()).map(new Function<String, Row>() {
@Override
public Row call(String line) throws Exception {
String[] fields = line.split(usedSeparator);
GenericRowWithSchema row = new GenericRowWithSchema(fields, schema);//Assum that the schema has 4 integer columns
return row;
}
});
DataFrame df = sqlContext.createDataFrame(rdd, schema);
df.registerTempTable("df");
DataFrame selectdf = sqlContext.sql("Select * from df");
Row[] res = selectdf.collect();
DataFrame sampleddf = sqlContext.createDataFrame(rdd, schema).sample(false, 0.1);// 10% of the original DS
sampleddf.registerTempTable("sampledf");
DataFrame selecteSampledf = sqlContext.sql("Select * from sampledf");
res = selecteSampledf.collect();
I would expect that the sampling is optimally close to ~90% faster. But for me it looks like that spark goes through the whole DF or does a count, which basically takes nearly the same time as for the full DF select. After the sample is generated, it executes the select.
Am I correct with this assumptions or is the sampling used in a wrong way what causes me to end up with the same required runtime for both selects?