HowTo run parallel Spark job using Airflow

2020-04-14 09:03发布

问题:

We have existing code in production that runs Spark jobs in parallel. We tried to orchestrate some mundane spark jobs using Airflow and we had success BUT now we are not sure how to proceed with spark jobs in parallel.

Can CeleryExecutor help in this case?

Or should we modify our existing Spark job not to run in parallel. I do not like the latter approach personally.

Our existing shell script that has spark job in parallel is something like this and we would like to run this shell script from Airflow:

cat outfile.txt | parallel -k -j2 submitspark {} /data/list

Please suggest.