Why is Spark not using all cores on local machine

2019-04-28 13:34发布

When I run some of the Apache Spark examples in the Spark-Shell or as a job, I am not able to achieve full core utilization on a single machine. For example:

var textColumn = sc.textFile("/home/someuser/largefile.txt").cache()
var distinctWordCount = textColumn.flatMap(line => line.split('\0'))
                             .map(word => (word, 1))
                             .reduceByKey(_+_)
                             .count()

When running this script, I mostly see only 1 or 2 active cores on my 8 core machine. Isn't Spark supposed to parallelise this?

标签： apache-spark parallel-processing mapreduce

2条回答

狗以群分

2楼-- · 2019-04-28 13:59

when you run a local spark shell, you still have to specify the number of cores that your spark tasks will use. if you want to use 8 cores make sure you

export MASTER=local[8]

before running your shell.

Also, as @zsxwing says, you may need to ensure that your data is partitioned into enough partitions to keep all of the cores busy, or that you specify the amount of parallelism you want to see.

0人赞添加讨论(0) 举报

小情绪 Triste *

3楼-- · 2019-04-28 14:15

You can use local[*] to run Spark locally with as many worker threads as logical cores has your machine.

0人赞添加讨论(0) 举报

Why is Spark not using all cores on local machine

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间