So I have a functionality in a Django Elastic Beanstalk app that works like so:
- Download a file
- Parse the file, run some calls to API's with the data from the file
- Update the database of the EB instance with the new data
In testing instances where I just set up a local cron job. I just called wget
on a specific URL of my Django application and it will run the command.
My problem is how to handle this in a multi-instanced Elastic Beanstalk application. Only one instance of my EB application should run this command. I want to avoid race conditions on the database and redundant calls to external API's from multiple instances. i.e. only one instance should be writing to the databe.
However, Googling around shows setting up cron jobs is awkward, particularly if your new to EB like I am. The most promising sounding method seems to be the cron.yaml
method, but there does not seem to be an example of setting up a cron worker environment anywhere on the web from what I can see.
My understanding is:
- You include a cron.yaml file in the root of your EB project.
- Deploy the project
- The cron jobs are automatically set up in a worker environment (?).
- The command you defined is ran at the specified time(s).
My question is how do you make sure that only one instance will run this command? Do I have the right idea on how cron.yaml
works or is there something I'm missing
Only one instance will run the command because the cron job does not actually run in a cron daemon per-se.
There are few concepts that might help you quickly grok amazon's Elastic Beanstalk mindset.
A message in the queue is picked up only once by one of the instances in the worker environment at a time.
Now the
cron.yaml
file actually just tells the leader to create a message in the queue with special attributes, at the times specified in the schedule. When it then finds this message, it's dispatched to one instance only as a POST request to the specified URL.When I use Django in a worker environment I create a
cron
app with views that map to the action I want. For example if I wanted to periodically poll a Facebook endpoint I might have a path/cron/facebook/poll/
which calls apoll_facebook()
function in views.pyThat way if I have a cron.yaml as follows, it'll poll Facebook once every hour: