At unpredictable times (user request) I need to run a memory-intensive job. For this I get a spot or on-demand instance and mark it with a tag as non_idle
. When the job is done (which may take hours), I give it the tag idle
. Due to the hourly billing model of AWS, I want to keep that instance alive until another billable hour is incurred in case another job comes in. If a job comes in, the instance should be reused and marked it as non_idle
. If no job comes in during that time, the instance should terminate.
Does AWS offer a ready solution for this? As far as I know, CloudWatch can't set alarms that should run at a specific time, never mind using the CPUUtilization or the instance's tags. Otherwise, perhaps I could simply set up for every created instance a java timer or scala actor that runs every hour after the instance is created and check for the tag idle
.
There is no readily available AWS solution for this fine grained optimization, but you can use the existing building blocks to build you own based on the launch time of the current instance indeed (see Dmitriy Samovskiy's smart solution for deducing How Long Ago Was This EC2 Instance Started?).
Playing 'Chicken'
Shlomo Swidler has explored this optimization in his article Play “Chicken” with Spot Instances, albeit with a slightly different motivation in the context of Amazon EC2 Spot Instances:
AWS Spot Instances have an interesting economic characteristic that
make it possible to game the system a little. Like all EC2 instances,
when you initiate termination of a Spot Instance then you incur a
charge for the entire hour, even if you’ve used less than a full hour.
But, when AWS terminates the instance due to the spot price exceeding
the bid price, you do not pay for the current hour.
The mechanics are the same of course, so you might be able to simply reuse the script he assembled, i.e. execute this script instead of or in addition to tagging the instance as idle
:
#! /bin/bash
t=/tmp/ec2.running.seconds.$$
if wget -q -O $t http://169.254.169.254/latest/meta-data/local-ipv4 ; then
# add 60 seconds artificially as a safety margin
let runningSecs=$(( `date +%s` - `date -r $t +%s` ))+60
rm -f $t
let runningSecsThisHour=$runningSecs%3600
let runningMinsThisHour=$runningSecsThisHour/60
let leftMins=60-$runningMinsThisHour
# start shutdown one minute earlier than actually required
let shutdownDelayMins=$leftMins-1
if [[ $shutdownDelayMins > 1 && $shutdownDelayMins < 60 ]]; then
echo "Shutting down in $shutdownDelayMins mins."
# TODO: Notify off-instance listener that the game of chicken has begun
sudo shutdown -h +$shutdownDelayMins
else
echo "Shutting down now."
sudo shutdown -h now
fi
exit 0
fi
echo "Failed to determine remaining minutes in this billable hour. Terminating now."
sudo shutdown -h now
exit 1
Once a job comes in you could then cancel the scheduled termination instead of or in addition to tagging the instance with non_idle
as follows:
sudo shutdown -c
This is also the the 'red button' emergency command during testing/operation, see e.g. Shlomo's warning:
Make sure you really understand what this script does before you use
it. If you mistakenly schedule an instance to be shut down you can
cancel it with this command, run on the instance: sudo shutdown -c
Adding CloudWatch to the game
You could take Shlomo's self contained approach even further by integrating with Amazon CloudWatch, which recently added an option to Use Amazon CloudWatch to Detect and Shut Down Unused Amazon EC2 Instances, see the introductory blog post Amazon CloudWatch - Alarm Actions for details:
Today we are giving you the ability to stop or terminate your EC2
instances when a CloudWatch alarm is triggered. You can use this as a
failsafe (detect an abnormal condition and then act) or as part of
your application's processing logic (await an expected condition and
then act). [emphasis mine]
Your use case is listed in section Application Integration specifically:
You can also create CloudWatch alarms based on Custom Metrics that you
observe on an instance-by-instance basis. You could, for example,
measure calls to your own web service APIs, page requests, or message
postings per minute, and respond as desired.
So you could leverage this new functionality by Publishing Custom Metrics to CloudWatch to indicate whether an instance should terminate (is idle
) based on and Dmitriy's launch time detection and reset the metric again once a job comes in and an instance should keep running (is non_idle
) - like so EC2 would take care of the termination, 2 out of 3 automation steps would have been moved from the instance into the operations environment and management and visibility of the automation process improved accordingly.