I am looking to schedule the Dataflow which has PubSubIO.readString from a PubSub topic's subscripton. How can i have the job to be terminating after a configured interval? My usecase is not to keep the job running through the entire day, so looking to schedule to start, and then stop after a configured interval from within the job.
Pipeline
.apply(PubsubIO.readMessages().fromSubscription("some-subscription"))
From docs:
If you need to stop a running Cloud Dataflow job, you can do so by
issuing a command using either the Cloud Dataflow Monitoring Interface
or the Cloud Dataflow Command-line Interface.
I would assume that you are not interested in stopping jobs manually via Console, which leaves you with the command line solution. If you intend to schedule your dataflow job to run e.g. daily, then you know at which time you want it to stop too (launch time + "configured interval"). In that case, you could configure a cron job to run the gcloud dataflow jobs cancel
at that time every day. For instance, the following script would cancel all active jobs having been launched within the day:
#!/bin/bash
gcloud dataflow jobs list --status=active --created-after=-1d \
| awk '{print $1;}' | tail -n +2 \
| while read -r JOB_ID; do gcloud dataflow jobs cancel $JOB_ID; done
Another solution would be to invoke the gcloud
command within your java code, using Runtime.getRuntime.exec()
. You can schedule this to run after a specific interval using java.util.Timer().schedule()
as noted here. This way you can ensure your job is going to stop after the provided time interval regardless of when you launched it.
UPDATE
@RoshanFernando correctly noted in comments that there's actually an SDK method to cancel a pipeline.