I am trying to run a Spark job. This is my shell script, which is located at /home/full/path/to/file/shell/my_shell_script.sh:
confLocation=../conf/my_config_file.conf &&
executors=8 &&
memory=2G &&
entry_function=my_function_in_python &&
dos2unix $confLocation &&
spark-submit \
--master yarn-client \
--num-executors $executors \
--executor-memory $memory \
--py-files /home/full/path/to/file/python/my_python_file.py $entry_function $confLocation
When I run this, I get an error that says:
Error: Cannot load main class from JAR file: /home/full/path/to/file/shell/my_function_in_python
My impression here is that it is looking in the wrong place (the python file is located in the python directory, not the shell directory).
What worked for me was to simply pass in the python files without the
--py-files
command. Looks like this:The
--py-files
flag is for additional python file dependencies used from your program; you can see here in SparkSubmit.scala it uses the so-called "primary argument", meaning first non-flag argument, to determine whether to do a "submit jarfile" mode or "submit python main" mode.That's why you see it trying to load your "$entry_function" as a jarfile that doesn't exist, since it only assumes you're running Python if that primary argument ends with ".py", and otherwise defaults to assuming you have a .jar file.
Instead of using
--py-files
, just make your/home/full/path/to/file/python/my_python_file.py
be the primary argument; then you can either do fancy python to take the "entry function" as a program argument, or you just call your entry function in your main function inside the python file itself.Alternatively, you can still use
--py-files
and then create a new main.py
file which calls your entry function, and then pass that main .py file as the primary argument instead.