To increase the MAX available memory I use :
export SPARK_MEM=1 g
Alternatively I can use
val conf = new SparkConf()
.setMaster("local")
.setAppName("My application")
.set("spark.executor.memory", "1g")
val sc = new SparkContext(conf)
The process I'm running requires much more than 1g. I would like to use 20g but I just have 8g of RAM available. Can disk memory be augmented with RAM memory as part of a Spark job, if so how is this achieved ?
Is there a Spark doc which describes how to distribute jobs to multiple Spark installations ?
For spark configuration I'm using all defaults (specified at http://spark.apache.org/docs/0.9.0/configuration.html) except for what I have specified above.
I have a single machine instance with following :
CPU : 4 cores
RAM : 8GB
HD : 40GB
Update :
I think this is the doc I'm looking for : http://spark.apache.org/docs/0.9.1/spark-standalone.html
If you are trying to solve a problem on a single computer, I do not think it is practical to use Spark. The point of Spark is that it provides a way to distribute computation across multiple machines, especially in cases where the data does not fit on a single machine.
That said, just set spark.executor.memory
to 20g
to get 20 GB of virtual memory. Once the physical memory is exhausted, swap will be used instead. If you have enough swap configured, you will be able to make use of 20 GB. But your process will most likely slow down to a crawl when it starts swapping.
If your job does not fit into memory Spark will automatically spill to disk - you do NOT need to setup swap - i.e. Daniel's answer is a bit inaccurate. You can configure what kind of processing will and will not spill to disk using the configuration settings: http://spark.apache.org/docs/0.9.1/configuration.html
Also it IS a good idea to use Spark on a single machine, because it means if you need your application to scale, you will get scaling for free - the same code you write to run 1-node will work N-node. Of course if your data is never expected to grow, then yes, stick with pure Scala.
Use spark.shuffle.spill
to control whether shuffles spill, and read the "persistence" documentation to control how RDD caching spills http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence