My Spark program take a large amount of zip files that contain JSON data from S3. It performs some cleaning on the data in the form of spark transforms. After that, I saved it as parquet files. When I run my program with 1GB data in 10 nodes 8GB configurations in AWS it takes about 11 min. I changed it to 20 nodes 32GB configuration. Still it takes about 10 min. Reduced only around 1 min. Why this kind of behavior?
相关问题
- How to maintain order of key-value in DataFrame sa
- How to generate 12 digit unique number in redshift
- Use awslogs with kubernetes 'natively'
- Spark on Yarn Container Failure
- In Spark Streaming how to process old data and del
相关文章
- Livy Server: return a dataframe as JSON?
- Right way to deploy Rails + Puma + Postgres app to
- how many objects are returned by aws s3api list-ob
- AWS S3 in rails - how to set the s3_signature_vers
- Passthrough input to output in AWS Step Functions
- I cannot locate production log files on Elastic Be
- ImportError: cannot import name 'joblib' f
- PUT to S3 with presigned url gives 403 error
Because adding more machines isn't always the solution, adding more machine leads to unnecessary data transfer over the network which can be the bottleneck in most cases.
Also 1GB of data isn't that big to perform scalability and performance benchmarking.