HowTo run parallel Spark job using Airflow

2020-04-14 08:52发布

We have existing code in production that runs Spark jobs in parallel. We tried to orchestrate some mundane spark jobs using Airflow and we had success BUT now we are not sure how to proceed with spark jobs in parallel.

Can CeleryExecutor help in this case?

Or should we modify our existing Spark job not to run in parallel. I do not like the latter approach personally.

Our existing shell script that has spark job in parallel is something like this and we would like to run this shell script from Airflow:

cat outfile.txt | parallel -k -j2 submitspark {} /data/list

Please suggest.

0条回答
登录 后发表回答