I'm using airflow to orchestrate some python scripts. I have a "main" dag from which several subdags are run. My main dag is supposed to run according to the following overview:
I've managed to get to this structure in my main dag by using the following lines:
etl_internal_sub_dag1 >> etl_internal_sub_dag2 >> etl_internal_sub_dag3
etl_internal_sub_dag3 >> etl_adzuna_sub_dag
etl_internal_sub_dag3 >> etl_adwords_sub_dag
etl_internal_sub_dag3 >> etl_facebook_sub_dag
etl_internal_sub_dag3 >> etl_pagespeed_sub_dag
etl_adzuna_sub_dag >> etl_combine_sub_dag
etl_adwords_sub_dag >> etl_combine_sub_dag
etl_facebook_sub_dag >> etl_combine_sub_dag
etl_pagespeed_sub_dag >> etl_combine_sub_dag
What I want airflow to do is to first run the etl_internal_sub_dag1
then the etl_internal_sub_dag2
and then the etl_internal_sub_dag3
. When the etl_internal_sub_dag3
is finished I want etl_adzuna_sub_dag
, etl_adwords_sub_dag
, etl_facebook_sub_dag
, and etl_pagespeed_sub_dag
to run in parallel. Finally, when these last four scripts are finished, I want the etl_combine_sub_dag
to run.
However, when I run the main dag, etl_adzuna_sub_dag
, etl_adwords_sub_dag
, etl_facebook_sub_dag
, and etl_pagespeed_sub_dag
are run one by one and not in parallel.
Question: How do I make sure that the scripts etl_adzuna_sub_dag
, etl_adwords_sub_dag
, etl_facebook_sub_dag
, and etl_pagespeed_sub_dag
are run in parallel?
Edit: My default_args
and DAG
look like this:
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': start_date,
'end_date': end_date,
'email': ['myname@gmail.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 0,
'retry_delay': timedelta(minutes=5),
}
DAG_NAME = 'main_dag'
dag = DAG(DAG_NAME, default_args=default_args, catchup = False)