Scheduling spark jobs on a timely basis

2019-04-15 23:59发布

问题:

Which is the recommended tool for scheduling Spark Jobs on a daily/weekly basis. 1) Oozie 2) Luigi 3) Azkaban 4) Chronos 5) Airflow

Thanks in advance.

回答1:

Updating my previous answer from here: Suggestion for scheduling tool(s) for building hadoop based data pipelines

  • Airflow: Try this first. Decent UI, Python-ish job definition, semi-accessible for non-programmers, dependency declaration syntax is weird.
    • Airflow has built in support for the fact that jobs scheduled jobs often need to be rerun and/or backfilled. Make sure you build your pipelines to support this.
  • Azkaban: Nice UI, relatively simple, accessible for non-programmers. Has a longish history at LinkedIn.
    • Azkaban enforces simplicity (can’t use features that don’t exist) and the others subtly encourage complexity.
    • Check out the Azkaban CLI project for programmatic job creation. https://github.com/mtth/azkaban (examples https://github.com/joeharris76/azkaban_examples)
  • Luigi: OK UI, workflows are pure Python, requires solid grasp of Python coding and object oriented concepts, hence not suitable for non-programmers.
  • Oozie: Insane XML based job definitions. Here be dragons. ;-)
  • Chronos: ¯\_(ツ)_/¯

Philosophy:

Simpler pipelines are better than complex pipelines: Easier to create, easier to understand (especially when you didn’t create) and easier to debug/fix.

When complex actions are needed you want to encapsulate them in a way that either completely succeeds or completely fails.

If you can make it idempotent (running it again creates identical results) then that’s even better.