Spot instances are 1.5-4x more cost efficient than on-demand instances, so the answer should be obvious – you should use them if it is possible. But sometimes it’s tricky and can cost you more than you expected.
Somebody may say: why are we discussing those $10 differences for a CI/CD job? Well, $10 for a single run of a daily job gives us $300 per month. According to the recent survey we conducted at our company, 80% of respondents mentioned that they have more than 50 jobs in their companies. Let’s assume that only 20% of those jobs consume cloud resources and run on a daily basis, that gives us about $36,000 of cost saving per year.
What is a spot instance and how to use it?
Spot instances are short-living instances provided by public clouds when they have spare resources and taken away when those resources are needed for other customers ready to use them for the on-demand pricing. You get a 5-minute notification from a cloud that your instance will be terminated, so you can either gracefully terminate your task or run it on other nodes. As there is no guarantee that you can use the instances for the whole desired period and clouds sell spare resources, they cost several times less than regular machines.
When to use the spot instances?
If you have stateless applications running in containers or you can easily handle a case when some nodes are down or taken away – you are doing a great job and you should use spot instances all the time to save on cloud costs.
But if you are not in containers or some pieces of your applications are stateful, you should try spot instances but you should be prepared for the termination notification and handle it properly.
Here are some tips when and where to use spot instances:
If you need your job to be completed with 100% success ratio or by some time, don’t use spot instances as there is always a chance of failure and facing the need to restart the job with either on-demand or spot instances. There is no guarantee that any of the spot instances won’t be terminated.
Limit your use of spot instances in hot periods like Black Friday, Christmas, or force major events like COVID-19 as there is a high chance that public cloud regions are overloaded and your spots will be revoked.
Prepare a workaround if some of the spot instances are terminated. You can either restart the whole job with spot or on-demand instances or complete it with some failed steps.
If your job or task is running for less than 10 hours, you should be ok to use spot instances, taking p.1 into consideration. If you run a job of 24+ hours, chances of the termination are very high.
Night hours and weekend day time in a time zone where public cloud regions are located are the best slots for the spot instances as cloud utilization drops and chances for the termination are significantly lower.
Try to use ‘General Purpose’ instance types for the spot instances. Public cloud regions are built in a way where they have the biggest capacity of the general purpose compute nodes (that’s why they are called ‘general purpose’) and a limited capacity of instances with special configurations like NVMe SSD, GPU, high compute or memory nodes.
Public clouds always reserve some capacity for on-demand instances, so even if there are spare NVMe machines you may not get them or your instances will be revoked.
Try to launch more ‘General Purpose’ instances with less compute and memory than fewer fully-packed instances. It is much easier for a cloud to pack small instances on compute nodes and chances are lower that an instance will be terminated if a cloud can rebalance it between nodes.