I am working with spark 1.5.0 an amazon's EMR. I have multiple properties file that I need to use in my spark-submit program. I explored the --properties-file
option. But it allows you to import properties from a single file. I need to read properties from a directory whose structure looks like :
├── AddToCollection
│ ├── query
│ ├── root
│ ├── schema
│ └── schema.json
├── CreateCollectionSuccess
│ ├── query
│ ├── root
│ ├── schema
│ └── schema.json
├── FeedCardUnlike
│ ├── query
│ ├── root
│ ├── schema
│ └── schema.json
In standalone mode I can get away with this by specifying the location of the files in the local system. But it doesn't work in cluster mode where I'm using a jar with the spark-submit command.
How can I do this in spark?
This works on Spark 1.6.1 (I haven't tested earlier versions)
spark-submit supports the --files
argument that accepts a comma separated list of "local" files to be submitted along with your JAR file to the driver.
spark-submit \
--class com.acme.Main \
--master yarn \
--deploy-mode cluster \
--driver-memory 2g \
--executor-memory 1g \
--driver-class-path "./conf" \
--files "./conf/app.properties,./conf/log4j.properties" \
./lib/my-app-uber.jar \
"$@"
In this example I have created an Uber JAR that does not contain any properties files. When I deploy my application the app.properties and log4j.properties files are placed into the local ./conf directory.
From the source for SparkSubmitArguments it states
--files FILES
Comma-separated list of files to be placed in the working directory of each executor.
I think you can package these files into your JAR file, and this JAR file will be submitted to Spark cluster.
For reading these files,
you can try java.util.Properties
and also refer to this Java Properties file examples
Hope it helps.