I have a Spark job which reads an HBase table, some aggregations and store data to mongoDB. Currently this job is running manually using the spark-submit script. I want to schedule it to run for a fixed interval.
How can I achieve this using java.
Any library?
Or Can I do this with Thread in java?
Any suggestions appreciated!
If you want to still use spark-submit
I would rather prefer crontab or something similar and run bash script for example.
But if you need to run "spark-submit" from java you can take a look to Package org.apache.spark.launcher. With this approach you can start application programatically with SparkLauncher
.
import org.apache.spark.launcher.SparkAppHandle;
import org.apache.spark.launcher.SparkLauncher;
...
public void startApacheSparkApplication(){
SparkAppHandle handler = new SparkLauncher()
.setAppResource("pathToYourSparkApp.jar")
.setMainClass("your.package.main.Class")
.setMaster("local")
.setConf(...)
.startApplication(); // <-- and start spark job app
}
...
But your question was about some scheduling library. You can use a simple Timer
with Date
provided in java util (java.util.TimerTask
), but I would prefer to use Quartz Job Scheduling Library - it is really popular (As I know spring uses Quartz Scheduler too).
Spring also features integration classes for supporting scheduling
with the Timer, part of the JDK since 1.3, and the Quartz Scheduler (
http://quartz-scheduler.org)
....
With Quartz you can set cron scheduling and for me it is more easier
to work with quartz.
Just add maven dependency
<!-- https://mvnrepository.com/artifact/org.quartz-scheduler/quartz -->
<dependency>
<groupId>org.quartz-scheduler</groupId>
<artifactId>quartz</artifactId>
<version>2.2.3</version>
</dependency>
create spark - Quartz job
public class SparkLauncherQuartzJob implements Job {
startApacheSparkApplication();
...
now create a trigger and schedule it
// trigger runs every hour
Trigger trigger = new Trigger()
.withIdentity("sparkJob1Trigger", "sparkJobsGroup")
.withSchedule(
CronScheduleBuilder.cronSchedule("0 * * * * ?"))
.build();
JobDetail sparkQuartzJob = JobBuilder.newJob(SparkLauncherQuartzJob.class).withIdentity("SparkLauncherQuartzJob", "sparkJobsGroup").build();
Scheduler scheduler = new StdSchedulerFactory().getScheduler();
scheduler.start();
scheduler.scheduleJob(sparkQuartzJob , trigger);
Unlikely - If you have spring boot application you can use scheduling for running some methods very easy - just @EnableScheduling
in configuration and something like this:
@Scheduled(fixedRate = 300000)
public void periodicalRunningSparkJob() {
log.info("Spark job periodically execution");
startApacheSparkApplication();
}