Why is Spark not using all cores on local machine

2019-04-28 13:34发布

When I run some of the Apache Spark examples in the Spark-Shell or as a job, I am not able to achieve full core utilization on a single machine. For example:

var textColumn = sc.textFile("/home/someuser/largefile.txt").cache()
var distinctWordCount = textColumn.flatMap(line => line.split('\0'))
                             .map(word => (word, 1))
                             .reduceByKey(_+_)
                             .count()

When running this script, I mostly see only 1 or 2 active cores on my 8 core machine. Isn't Spark supposed to parallelise this?

2条回答
狗以群分
2楼-- · 2019-04-28 13:59

when you run a local spark shell, you still have to specify the number of cores that your spark tasks will use. if you want to use 8 cores make sure you

export MASTER=local[8]

before running your shell.

Also, as @zsxwing says, you may need to ensure that your data is partitioned into enough partitions to keep all of the cores busy, or that you specify the amount of parallelism you want to see.

查看更多
小情绪 Triste *
3楼-- · 2019-04-28 14:15

You can use local[*] to run Spark locally with as many worker threads as logical cores has your machine.

查看更多
登录 后发表回答