Problem description: storage classes in Amazon S3
Amazon S3 (Simple Storage Service) provides a convenient and affordable opportunity to store your data as objects in a global, scalable, performant and highly-available (up to 11 nines of durability) manner. Nowadays, almost every R&D company uses object storage for different purposes – from storing little artifacts, like software packages and documents, to the multi-GB Golden Images generated by CI/CD runs or the millions of data objects in multi-PB datalakes for the ML/AI trainings.
Amazon S3 is a convenient common storage for different services in your infrastructure, therefore, its cost saving possibilities are very valuable for every organization.
S3 as an object storage is much more affordable than EC2’s EBS volumes (block storage). If we look and compare the most commonly used basic storage classes in S3 and EBS, we’ll find that S3 is more than 3x cost-effective than EBS – e.g. $0.023 per GB-month in S3 Standard vs $0.08 per GB-month in EBS General Purpose SSD storage. However, unlike the EBS storage where you explicitly reserve the amount of data you are going to store by creating an EBS volume of some size, S3 doesn’t limit you on the amount of data you might put in the storage. In accordance with the statistics even small organizations usually reach more than 50TB in object storage after 2-3 years of development, thus the question of cost savings on object storage becomes more and more significant for everyone.
Amazon S3 provides different storage classes for different usage patterns. Most of the saving options are based on explicit storage class settings for your GET access pattern. So if your data is addressed often, you would rather choose Standard storage class (it is applied by default, so most probably you use it), while you can save if you clearly identify that you don’t need some objects to be accessed frequently or even are ready to wait for some time (up to 12 hours) to retrieve your data from the cold storage. You can see Amazon S3 pricing page to see all the available options.
While you have the possibility to specify storage class according to your usage pattern, it requires an explicit action for you and also some robust planning and research before implementing that, especially if you want to optimize savings for the existing data.
That’s why I recommend you to start with the Intelligent-Tiering storage class for your existing S3 storage and then proceed with the advanced techniques for the new use cases.
Intelligent-Tiering allows you to let S3 detect the objects that are not accessed frequently and change storage class for them, putting them into Infrequent Access Tier (about 50% cheaper) or even into Archive or Deep Archive Access Tiers (about 80% and 95% savings respectively). However, as Archive and Deep Archive Access Tiers don’t provide millisecond access time, targeting Intelligent-Tiering to the Infrequent Access Tier is the easiest and safest way to save on the object storage.
When I should use Intelligent-Tiering
Generally I recommend to set Intelligent-Tiering as a default setting for all the data you put to the object storage except the data for which you clearly identified usage patterns and have external services that use the data designed for that specific usage pattern.
The most common cases when objects are not frequently accessed:
- software artifacts from CI/CD jobs
- logs history
- long-term backups of your infrastructure (those which still need to have immediate RTO)
- raw data for your ML/AI services, which already have been processed for them (e.g. via Amazon Glue).
How to set up Intelligent-Tiering in the most efficient way
You can set the Intelligent-Tiering storage class for the objects while you are performing PUT/POST operations or initiating Multipart Upload. However, this obviously is not applicable to the existing data.
The most obvious way to apply Intelligent-Tiering to the existing data is to change their storage class via S3 console: selecting the objects and navigating to Actions->Edit storage class. However, you need to perform it manually or use a PUT-Copy Object API call. There is also a side effect of the way like this to change storage class – objects will be copied, therefore they will be new objects, with the new last-modified date and updated metadata; you will also need to clean up old object copies by yourself if you use API calls.
The most effective way to apply Intelligent-Tiering for existing and new objects is to set an S3 Lifecycle configuration rule.
To do that, navigate to the S3 console, select the bucket to which you want to set it up.
Then:
- Navigate to Management tab
- Click “Create lifecycle rule” button in the Lifecycle rules section
- Name your rule, e.g. “Transition to Intelligent-Tiering”
- You can limit the scope of this rule by object prefix (part of the path to the target objects inside the bucket) or object tags, or set all objects in the bucket as a target
- Check “Transition current versions of objects between storage classes” and “Transition previous versions of objects between storage classes” in Lifecycle rule actions section
- Set target storage class “Intelligent tiering” and time threshold (e.g. 30 days) for both transitions
- Click “Create rule”
The rule is created and will be applied daily to the scope you specified.
Congratulations, you’ve just improved your S3 usage and started to save on the object storage!
Max Bozhenko, FinOps enthusiast and practitioner, CTO at Hystax