Could somebody please clarify the expected behavior when using save_main_session
and custom modules imported in __main__
. My DataFlow pipeline imports 2 non-standard modules - one via requirements.txt
and another one via setup_file
. Unless I move the imports into the functions where they get used I keep getting import/pickling errors. Sample error is below. From the documentation, I assumed that setting save_main_session
would help to solve this problem, but it does not (see error below). So I wonder if I missed something or this behavior is by design. The same import works fine when placed into a function.
Error:
File "/usr/lib/python2.7/pickle.py", line 1130, in find_class __import__(module) ImportError: No module named jmespath
https://cloud.google.com/dataflow/faq#how-do-i-handle-nameerrors https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/
When to use
--save_main_session
:The setup that best works for me is having a
dataflow_launcher.py
sitting at the project root with yoursetup.py
. The only thing it does is import your pipeline file and launch it. Usesetup.py
to handle all your dependencies. This is the best example I've found so far.https://github.com/apache/beam/tree/master/sdks/python/apache_beam/examples/complete/juliaset