Spark on EMR : Time for running data in EMR didn&#

2019-09-08 11:54发布

My Spark program take a large amount of zip files that contain JSON data from S3. It performs some cleaning on the data in the form of spark transforms. After that, I saved it as parquet files. When I run my program with 1GB data in 10 nodes 8GB configurations in AWS it takes about 11 min. I changed it to 20 nodes 32GB configuration. Still it takes about 10 min. Reduced only around 1 min. Why this kind of behavior?

标签： amazon-web-services amazon-s3 apache-spark emr

1条回答

再贱就再见

2楼-- · 2019-09-08 12:18

Because adding more machines isn't always the solution, adding more machine leads to unnecessary data transfer over the network which can be the bottleneck in most cases.

Also 1GB of data isn't that big to perform scalability and performance benchmarking.

0人赞添加讨论(0) 举报

Spark on EMR : Time for running data in EMR didn&#

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间