Scheduling spark jobs on a timely basis

2019-04-16 00:01发布

Which is the recommended tool for scheduling Spark Jobs on a daily/weekly basis. 1) Oozie 2) Luigi 3) Azkaban 4) Chronos 5) Airflow

Thanks in advance.

1条回答
不美不萌又怎样
2楼-- · 2019-04-16 00:33

Updating my previous answer from here: Suggestion for scheduling tool(s) for building hadoop based data pipelines

  • Airflow: Try this first. Decent UI, Python-ish job definition, semi-accessible for non-programmers, dependency declaration syntax is weird.
    • Airflow has built in support for the fact that jobs scheduled jobs often need to be rerun and/or backfilled. Make sure you build your pipelines to support this.
  • Azkaban: Nice UI, relatively simple, accessible for non-programmers. Has a longish history at LinkedIn.
  • Luigi: OK UI, workflows are pure Python, requires solid grasp of Python coding and object oriented concepts, hence not suitable for non-programmers.
  • Oozie: Insane XML based job definitions. Here be dragons. ;-)
  • Chronos: ¯\_(ツ)_/¯

Philosophy:

Simpler pipelines are better than complex pipelines: Easier to create, easier to understand (especially when you didn’t create) and easier to debug/fix.

When complex actions are needed you want to encapsulate them in a way that either completely succeeds or completely fails.

If you can make it idempotent (running it again creates identical results) then that’s even better.

查看更多
登录 后发表回答