How to fail a (cron) job after a certain number of

2020-04-02 07:41发布

问题:

We have a Kubernetes cluster of web scraping cron jobs set up. All seems to go well until a cron job starts to fail (e.g., when a site structure changes and our scraper no longer works). It looks like every now and then a few failing cron jobs will continue to retry to the point it brings down our cluster. Running kubectl get cronjobs (prior to a cluster failure) will show too many jobs running for a failing job.

I've attempted following the note described here regarding a known issue with the pod backoff failure policy; however, that does not seem to work.

Here is our config for reference:

apiVersion: batch/v1beta1
kind: CronJob
metadata:
  name: scrape-al
spec:
  schedule: '*/15 * * * *'
  concurrencyPolicy: Allow
  failedJobsHistoryLimit: 0
  successfulJobsHistoryLimit: 0
  jobTemplate:
    metadata:
      labels:
        app: scrape
        scrape: al
    spec:
      template:
        spec:
          containers:
            - name: scrape-al
              image: 'govhawk/openstates:1.3.1-beta'
              command:
                - /opt/openstates/openstates/pupa-scrape.sh
              args:
                - al bills --scrape
          restartPolicy: Never
      backoffLimit: 3

Ideally we would prefer that a cron job would be terminated after N retries (e.g., something like kubectl delete cronjob my-cron-job after my-cron-job has failed 5 times). Any ideas or suggestions would be much appreciated. Thanks!

回答1:

You can tell your Job to stop retrying using backoffLimit.

Specifies the number of retries before marking this job failed.

In your case

spec:
  template:
    spec:
      containers:
        - name: scrape-al
          image: 'govhawk/openstates:1.3.1-beta'
          command:
            - /opt/openstates/openstates/pupa-scrape.sh
          args:
            - al bills --scrape
      restartPolicy: Never
  backoffLimit: 3

You set 3 asbackoffLimit of your Job. That means when a Job is created by CronJob, It will retry 3 times if fails. This controls Job, not CronJob

When Job is failed, another Job will be created again as your scheduled period.

You want: If I am not wrong, you want to stop scheduling new Job, when your scheduled Jobs are failed for 5 times. Right?

Answer: In that case, this is not possible automatically.

Possible solution: You need to suspend CronJob so than it stop scheduling new Job.

Suspend: true

You can do this manually. If you do not want to do this manually, you need to setup a watcher, that will watch your CronJob status, and will update CronJob to suspend if necessary.



标签: kubernetes