i am currently evaluating Spark 2.1.0 on a small cluster (3 Nodes with 32 CPUs and 128 GB Ram) with a benchmark in linear regression (Spark ML). I only measured the time for the parameter calculation (not including start, data loading, …) and recognized the following behavior. For small datatsets 0.1 Mio – 3 Mio datapoints the measured time is not really increasing and stays at about 40 seconds. Only with larger datasets like 300 Mio datapoints the processing time went up to 200 seconds. So it seems, the cluster does not scale at all to small datasets.
I also compared the small dataset on my local pc with the cluster using only 10 worker and 16GB ram. The processing time of the cluster is larger by a factor of 3. So is this considered normal behavior of SPARK and explainable by communication overhead or am I doing something wrong (or is linear regression not really representative)?
The cluster is a standalone cluster (without Yarn or Mesos) and the benchmarks where submitted with 90 worker, each with 1 core and 4 GB ram.
Spark submit:
./spark-submit --master spark://server:7077 --class Benchmark --deploy-mode client --total-executor-cores 90 --executor-memory 4g --num-executors 90 .../Benchmark.jar pathToData
The optimum cluster size and configuration varies based on the data and the nature of the job. In this case, I think that your intuition is correct, the job seems to take disproportionately longer to complete on smaller dataset, because of the excess overhead given the size of the cluster (cores and executors).
Notice that increasing the amount of data by two orders of magnitude increases the processing time only 5-fold. You are increasing the data toward an optimum size for your cluster setup.
Spark is a great tool for processing lots of data, but it isn't going to be competitive with running a single process on a single machine if the data will fit. However it can be much faster than other distributed processing tools that are disk-based, where the data does not fit on a single machine.
I was at a talk a couple years back and the speaker gave an analogy that Spark is like a locomotive racing a bicycle:- the bike will win if the load is light, it is quicker to accelerate and more agile, but with a heavy load the locomotive might take a while to get up to speed, but it's going to be faster in the end. (I'm afraid I forget the speakers name, but it was at a Cassandra meetup in London, and the speaker was from a company in the energy sector).
I agree with @ImDarrenG's assessment and generally also the locomotive/bicycle analogy.
With such a small amount of data, I would strongly recommend
A) caching the entire dataset and
B) broadcasting the dataset to each node (especially if you need to do something like your 300M row table join to the small datasets)
Another thing to consider is the # of files (if you're not already cached), because if you're reading in a single unsplittable file, only 1 core will be able to read that file in. However once you cache the dataset (coalescing or repartitioning as appropriate), performance will no longer be bound by disk/serializing the rows.