When I run some of the Apache Spark examples in the Spark-Shell or as a job, I am not able to achieve full core utilization on a single machine. For example:
var textColumn = sc.textFile("/home/someuser/largefile.txt").cache()
var distinctWordCount = textColumn.flatMap(line => line.split('\0'))
.map(word => (word, 1))
.reduceByKey(_+_)
.count()
When running this script, I mostly see only 1 or 2 active cores on my 8 core machine. Isn't Spark supposed to parallelise this?
when you run a local spark shell, you still have to specify the number of cores that your spark tasks will use. if you want to use 8 cores make sure you
before running your shell.
Also, as @zsxwing says, you may need to ensure that your data is partitioned into enough partitions to keep all of the cores busy, or that you specify the amount of parallelism you want to see.
You can use
local[*]
to run Spark locally with as many worker threads as logical cores has your machine.