Wanted some insights on spark execution on standalone and yarn. We have a 4 node cloudera cluster and currently the performance of our application while running in YARN mode is less than half than what we are getting while executing in standalone mode. Is anyone having some idea on the factors which might be contributing for this.
相关问题
- How to maintain order of key-value in DataFrame sa
- Faster loop: foreach vs some (performance of jsper
- Why wrapping a function into a lambda potentially
- Spark on Yarn Container Failure
- Ado.net performance:What does SNIReadSync do?
相关文章
- Livy Server: return a dataframe as JSON?
- DOM penalty of using html attributes
- Which is faster, pointer access or reference acces
- Django is sooo slow? errno 32 broken pipe? dcramer
- Understanding the difference between Collection.is
- SQL query Frequency Distribution matrix for produc
- parallelizing matrix multiplication through thread
- How to determine JS bottlenecks in React Native co
Basically, your data and cluster are too small.
Big Data technologies are really meant to handle data that cannot fit on a single system. Given your cluster has 4 nodes, it might be fine for POC work but you should not consider this acceptable for benchmarking your application.
To give you a frame of reference refer to Hortonworks's article BENCHMARK: SUB-SECOND ANALYTICS WITH APACHE HIVE AND DRUID uses a cluster of:
This works out to 320 CPU cores, 2560GB RAM, 240TB of disk.
Another benchmark from Cloudera's article New SQL Benchmarks: Apache Impala (incubating) Uniquely Delivers Analytic Database Performance uses a 21 node cluster with each node at:
This works out to 504 CPU cores, 8064GB RAM and 231TB of disk.
This should give an idea of the scale that would qualify your system as reliable for benchmarking purposes.