When I try to run a DAG in Airflow 1.8.0 I find that it takes a lot of time between the time of completion predecessor task and the time at which the successor task is picked up for execution (usually greater the execution times of individual tasks). The same is the scenario for Sequential, Local and Celery Executors. Is there a way to lessen the overhead time mentioned? (like any parameters in airflow.cfg that can speed up the DAG execution?) Gantt chart has been added for reference:
问题:
回答1:
As Nick said, Airflow is not a real-time tool. Tasks are scheduled and executed ASAP, but the next Task will never run immediately after the last one.
When you have more than ~100 DAGs with ~3 Tasks in each one or Dags with many Tasks (~100 or more), you have to consider 3 things:
- Increase the number of threads that the DagFileProcessorManager will use to load and execute the Dags (airflow.cfg):
[scheduler]
max_threads = 2
The max_threads controls how many DAGs are picked and executed/terminated (see here).
Increasing this configuration may reduce the time between the Tasks.
- Monitor your Airflow Database to see if it has any bottlenecks. The Airflow database is used to manage and execute processes:
Recently we were suffering with the same problem. The time between Tasks was ~10-15 minutes, we were using PostgreSQL on AWS.
The instance was not using the resources very well; ~20 IOPS, 20% of the memory and ~10% of CPU, but Airflow was very slow.
After looking at the database performance using PgHero, we discovered that even a query using an Index on a small table was spending more than one second.
So we increased the Database size, and Airflow is now running as fast as a rocket. :)
- To get the time Airflow is spending loading Dags, run the command:
airflow list_dags -r
DagBag parsing time: 7.9497220000000075
If the DagBag parsing time is higher than ~5 minutes, it could be an issue.
All of this helped us to run Airflow faster. I really advise you to upgrade to version 1.9 as there are many performance issues that were fixed on this version
BTW, we are using the Airflow master in production, with LocalExecutor and PostgreSQL as the metadata database.
回答2:
Your Gantt chart shows things in the order of seconds. Airflow is not meant to be a real-time scheduling engine. It deals with things on the order of minutes. If you need things to run faster, you may consider different scheduling tool from airflow. Alternatively you can put all of the work into a single task so you do not suffer from the delays of the scheduler.
回答3:
I had to patch the dag fill code because each worker spent over 30 seconds filling up the dag bag. The issue is with the models.py detect_downstream_cycle code which takes a long time to run. In my testing using the list_dags command here are my results: