Spark: additional properties in a directory

2019-08-12 00:06发布

问题:

I am working with spark 1.5.0 an amazon's EMR. I have multiple properties file that I need to use in my spark-submit program. I explored the --properties-file option. But it allows you to import properties from a single file. I need to read properties from a directory whose structure looks like :

├── AddToCollection
│   ├── query
│   ├── root
│   ├── schema
│   └── schema.json
├── CreateCollectionSuccess
│   ├── query
│   ├── root
│   ├── schema
│   └── schema.json
├── FeedCardUnlike
│   ├── query
│   ├── root
│   ├── schema
│   └── schema.json

In standalone mode I can get away with this by specifying the location of the files in the local system. But it doesn't work in cluster mode where I'm using a jar with the spark-submit command. How can I do this in spark?

回答1:

This works on Spark 1.6.1 (I haven't tested earlier versions)

spark-submit supports the --files argument that accepts a comma separated list of "local" files to be submitted along with your JAR file to the driver.

spark-submit \
    --class com.acme.Main \
    --master yarn \
    --deploy-mode cluster \
    --driver-memory 2g \
    --executor-memory 1g \
    --driver-class-path "./conf" \
    --files "./conf/app.properties,./conf/log4j.properties" \
    ./lib/my-app-uber.jar \
    "$@"

In this example I have created an Uber JAR that does not contain any properties files. When I deploy my application the app.properties and log4j.properties files are placed into the local ./conf directory.

From the source for SparkSubmitArguments it states

--files FILES
Comma-separated list of files to be placed in the working directory of each executor.



回答2:

I think you can package these files into your JAR file, and this JAR file will be submitted to Spark cluster.

For reading these files,

you can try java.util.Properties

and also refer to this Java Properties file examples

Hope it helps.