Between Apache Oozie, Spotify/Luigi and airbnb/airflow, what are the pros and cons for each of them?
I have used oozie and airflow in the past for building a data ingestion pipeline using PIG and Hive. Currently, I am in the process of building a pipeline that looks at logs and extracts out useful events and puts them on redshift.
I found that airflow was much easier to use/test/setup. It has a much cooler UI and lets users perform actions from the UI itself, which is not the case with Oozie. Any information about Luigi or other insights regarding stability and issues are welcome.
This post will give you an initial idea about different possible workflows
http://bytepawn.com/luigi-airflow-pinball.html
IMHO, Azkaban enforces simplicity (can’t use features that don’t exist) and the others subtly encourage complexity.
Simpler pipelines are better than complex pipelines: Easier to create, easier to understand (especially when you didn’t create) and easier to debug/fix.
When complex actions are needed you want to encapsulate them in a way that either completely succeeds or completely fails.
If you can make it idempotent (running it again creates identical results) then that’s even better.