Spark working faster in Standalone rather than YAR

2020-03-30 05:21发布

问题:

Wanted some insights on spark execution on standalone and yarn. We have a 4 node cloudera cluster and currently the performance of our application while running in YARN mode is less than half than what we are getting while executing in standalone mode. Is anyone having some idea on the factors which might be contributing for this.

回答1:

Basically, your data and cluster are too small.

Big Data technologies are really meant to handle data that cannot fit on a single system. Given your cluster has 4 nodes, it might be fine for POC work but you should not consider this acceptable for benchmarking your application.

To give you a frame of reference refer to Hortonworks's article BENCHMARK: SUB-SECOND ANALYTICS WITH APACHE HIVE AND DRUID uses a cluster of:

10 nodes

2x Intel(R) Xeon(R) CPU E5-2640 v2 @ 2.00GHz with 16 CPU threads each

256 GB RAM per node

6x WDC WD4000FYYZ-0 1K02 4TB SCSI disks per node

This works out to 320 CPU cores, 2560GB RAM, 240TB of disk.

Another benchmark from Cloudera's article New SQL Benchmarks: Apache Impala (incubating) Uniquely Delivers Analytic Database Performance uses a 21 node cluster with each node at: