Scheduling a Spark Job Java

2019-06-25 05:52发布

问题:

I have a Spark job which reads an HBase table, some aggregations and store data to mongoDB. Currently this job is running manually using the spark-submit script. I want to schedule it to run for a fixed interval.

How can I achieve this using java.

Any library? Or Can I do this with Thread in java?

Any suggestions appreciated!

回答1:

If you want to still use spark-submit I would rather prefer crontab or something similar and run bash script for example.

But if you need to run "spark-submit" from java you can take a look to Package org.apache.spark.launcher. With this approach you can start application programatically with SparkLauncher.

import org.apache.spark.launcher.SparkAppHandle;
import org.apache.spark.launcher.SparkLauncher;

...

     public void startApacheSparkApplication(){
        SparkAppHandle handler = new SparkLauncher()
         .setAppResource("pathToYourSparkApp.jar")
         .setMainClass("your.package.main.Class")
         .setMaster("local")
         .setConf(...)
         .startApplication(); // <-- and start spark job app
     }
...

But your question was about some scheduling library. You can use a simple Timer with Date provided in java util (java.util.TimerTask), but I would prefer to use Quartz Job Scheduling Library - it is really popular (As I know spring uses Quartz Scheduler too).

Spring also features integration classes for supporting scheduling with the Timer, part of the JDK since 1.3, and the Quartz Scheduler ( http://quartz-scheduler.org) ....

With Quartz you can set cron scheduling and for me it is more easier to work with quartz.

Just add maven dependency

<!-- https://mvnrepository.com/artifact/org.quartz-scheduler/quartz -->
<dependency>
    <groupId>org.quartz-scheduler</groupId>
    <artifactId>quartz</artifactId>
    <version>2.2.3</version>
</dependency>

create spark - Quartz job

   public class SparkLauncherQuartzJob implements Job {
         startApacheSparkApplication();
   ...

now create a trigger and schedule it

 // trigger runs every hour
 Trigger trigger = new Trigger() 
             .withIdentity("sparkJob1Trigger", "sparkJobsGroup")
             .withSchedule(
                 CronScheduleBuilder.cronSchedule("0 * * * * ?"))
             .build();


  JobDetail sparkQuartzJob = JobBuilder.newJob(SparkLauncherQuartzJob.class).withIdentity("SparkLauncherQuartzJob", "sparkJobsGroup").build();

  Scheduler scheduler = new StdSchedulerFactory().getScheduler();
  scheduler.start();
  scheduler.scheduleJob(sparkQuartzJob , trigger);

Unlikely - If you have spring boot application you can use scheduling for running some methods very easy - just @EnableScheduling in configuration and something like this:

@Scheduled(fixedRate = 300000)
public void periodicalRunningSparkJob() {
    log.info("Spark job periodically execution");
    startApacheSparkApplication();
}