I want to use Airflow to implement data flows that periodically poll external systems (ftp servers, etc), check for new files matching certain conditions, and then run a bunch of tasks for those files. Now, I'm a newbie to Airflow and read that Sensors are something you would use for this kind of a case, and I actually managed to write a sensor that works ok when I run "airflow test" for it. But I'm a bit confused regarding the relation of poke_interval for the sensor and the DAG scheduling. How should I define those settings for my use case? Or should I use some other approach? I just want Airflow to run the tasks when those files become available, and not flood the dashboard with failures when no new files were available for a while.
相关问题
- How to define operations of an STFP Operator on Ai
- How to export large data from Postgres to S3 using
- Airflow - Experimental API returning 405s for some
- How to add an Airflow Pool via environment variabl
- Airflow task after BranchPythonOperator does not f
相关文章
- Airflow depends_on_past explanation
- How to delete XCOM objects once the DAG finishes i
- Get Exception details on Airflow on_failure_callba
- How to run one airflow task and all its dependenci
- Running airflow tasks/dags in parallel
- External files in Airflow DAG
- Cannot run apache airflow after fresh install, pyt
- Run .EXE and Powershell tasks with Airflow
Your understanding is correct, using a sensor is the way to go when you want to poll, either by using an existing sensor or by implementing your own.
They are, however, always part of a DAG and they do not execute outside of its boundaries. DAG execution depends on the
start_date
andschedule_interval
, but you can leverage this and a sensor to implement some sort of DAG depending on the status of an external server: one possible approach would be starting the whole DAG with a sensor which checks for a condition to occur and decide to skip the whole DAG if the condition is not met (you can make sure that sensors mark downstream tasks asskipped
and notfailed
by setting theirsoft_fail
parameter toTrue
). You can have a polling interval of one minute by using the most frequent scheduling option (* * * * *
). If you really need a shortest polling time you can tweak the sensor'spoke_interval
andtimeout
parameters.Keep in mind, however, that execution times are not probably guaranteed by Airflow itself, so for very short polling times you may want to investigate alternatives (or at least consider different approaches to the one I've just shared).