At unpredictable times (user request) I need to run a memory-intensive job. For this I get a spot or on-demand instance and mark it with a tag as non_idle
. When the job is done (which may take hours), I give it the tag idle
. Due to the hourly billing model of AWS, I want to keep that instance alive until another billable hour is incurred in case another job comes in. If a job comes in, the instance should be reused and marked it as non_idle
. If no job comes in during that time, the instance should terminate.
Does AWS offer a ready solution for this? As far as I know, CloudWatch can't set alarms that should run at a specific time, never mind using the CPUUtilization or the instance's tags. Otherwise, perhaps I could simply set up for every created instance a java timer or scala actor that runs every hour after the instance is created and check for the tag idle
.
There is no readily available AWS solution for this fine grained optimization, but you can use the existing building blocks to build you own based on the launch time of the current instance indeed (see Dmitriy Samovskiy's smart solution for deducing How Long Ago Was This EC2 Instance Started?).
Playing 'Chicken'
Shlomo Swidler has explored this optimization in his article Play “Chicken” with Spot Instances, albeit with a slightly different motivation in the context of Amazon EC2 Spot Instances:
The mechanics are the same of course, so you might be able to simply reuse the script he assembled, i.e. execute this script instead of or in addition to tagging the instance as
idle
:Once a job comes in you could then cancel the scheduled termination instead of or in addition to tagging the instance with
non_idle
as follows:This is also the the 'red button' emergency command during testing/operation, see e.g. Shlomo's warning:
Adding CloudWatch to the game
You could take Shlomo's self contained approach even further by integrating with Amazon CloudWatch, which recently added an option to Use Amazon CloudWatch to Detect and Shut Down Unused Amazon EC2 Instances, see the introductory blog post Amazon CloudWatch - Alarm Actions for details:
Your use case is listed in section Application Integration specifically:
So you could leverage this new functionality by Publishing Custom Metrics to CloudWatch to indicate whether an instance should terminate (is
idle
) based on and Dmitriy's launch time detection and reset the metric again once a job comes in and an instance should keep running (isnon_idle
) - like so EC2 would take care of the termination, 2 out of 3 automation steps would have been moved from the instance into the operations environment and management and visibility of the automation process improved accordingly.