Spark Sampling - How much faster is it than using

2019-09-07 03:37发布

问题:

I'm wondering what the runtime of Spark is when sampling a RDD/DF compared with the runtime of the full RDD/DF. I don't know if it makes a difference but I'm currently using Java + Spark 1.5.1 + Hadoop 2.6.

JavaRDD<Row> rdd = sc.textFile(HdfsDirectoryPath()).map(new Function<String, Row>() {
        @Override
        public Row call(String line) throws Exception {
            String[] fields = line.split(usedSeparator);
            GenericRowWithSchema row = new GenericRowWithSchema(fields, schema);//Assum that the schema has 4 integer columns
            return row;
            }
        });

DataFrame df   = sqlContext.createDataFrame(rdd, schema);
df.registerTempTable("df");
DataFrame selectdf   =  sqlContext.sql("Select * from df");
Row[] res = selectdf.collect();

DataFrame sampleddf  = sqlContext.createDataFrame(rdd, schema).sample(false, 0.1);// 10% of the original DS
sampleddf.registerTempTable("sampledf");
DataFrame selecteSampledf = sqlContext.sql("Select * from sampledf");
res = selecteSampledf.collect();

I would expect that the sampling is optimally close to ~90% faster. But for me it looks like that spark goes through the whole DF or does a count, which basically takes nearly the same time as for the full DF select. After the sample is generated, it executes the select.

Am I correct with this assumptions or is the sampling used in a wrong way what causes me to end up with the same required runtime for both selects?

回答1:

I would expect that the sampling is optimally close to ~90% faster.

Well, there are a few reasons why these expectations are unrealistic:

  • without any previous assumptions about data distribution, to obtain an uniform sample, you have to perform a full dataset scan. This is more or less what happens when you use sample or takeSample methods in Spark
  • SELECT * is a relatively lightweight operation. Depending on the amount of resources you have time to process a single partition can be negligible
  • sampling doesn't reduce number of partitions. If you don't coalesce or repartition you can end up with a large number of almost empty partitions. It means suboptimal resource usage.
  • while RNGs are usually quite efficient generating random numbers is not free

There are at least two important benefits of sampling:

  • lower memory usage including less work for the garbage collector
  • less data to serialize / deserialize and transfer in case of shuffling or collecting

If you want to get most from sampling it make sense to sample, coalesce, and cache.