Enhancing storage efficiency and reducing storage costs with s3-batch-object-store

Engineering

3 July 2024 • 8 min read

At Embrace, we constantly strive to improve our infrastructure to handle the massive amount of data we process daily. As a mobile app observability platform, one of our more significant challenges was managing the storage and retrieval of large objects, in particular sessions, which is our biggest object. For context, our infrastructure processes tens of millions of user sessions per hour, with individual sessions ranging from 30 KB to 80 KB in size.

Our previous solution involved using Cassandra to store these objects, and fetch them by ID when needed, but this approach led to several issues, prompting us to seek a more efficient method. This quest resulted in the implementation of a new component that allows us to store multiple objects into one single file in S3, reducing costs drastically, and simplifying our infrastructure.

Given that this solution has worked well for our needs and can be used for any number of large cloud object storage requirements, we decided to open source the module and share with the community, so that anyone can use it!

TL;DR: We reduced storage costs by 70% with our new approach. You can head to our s3-batch-object-store GitHub repository to check out the module and try it out. Let us know what you think, as we welcome feedback, issues, and pull requests to continue improving the module.

The problem

Initially, we stored large objects as blob JSONs in Cassandra. This method quickly became inefficient for several reasons:

Infrastructure costs: Our setup required substantial EBS volumes to accommodate Cassandra’s storage needs, leading to high costs and complicated backup strategies. Given that we only read about 0.01% of the stored sessions and that most of our costs were just EBS volumes for Cassandra, this solution was not cost-efficient.
Frequent compactions: Cassandra’s need to frequently compact large amounts of data degraded its performance. Compactions are resource-intensive processes that clean up and optimize the data storage, and our large blobs were exacerbating the frequency and duration of these operations.
Write-heavy payloads, low read rate: Simply switching to store these in S3 would be even more expensive, as S3 charges you for three things: PUT operations, GET operations, and storage size. In our scenario, the number of PUT operations by storing each session in one file would have been so high, that it would have been even more expensive than Cassandra itself.

The solution

We developed a module to address these issues effectively, and here is how it works:

The module allows you to create a temp file that will write to the local disk whatever you append.
The module stores information about the individual objects you store in the file, including an offset and a length, that represents the byte chunk for that particular object.
The module allows you to read a single object without having to read the entire file.

Writing the sessions

In our particular case, we process sessions from a Kafka topic, so now we append these to one of these temp files, by first marshaling the session as JSON, and compressing it with zstd.

We keep appending sessions until one of the following conditions is reached:

Max time: We want to store sessions quickly, so if a file is opened for more than a few seconds, we process it.
Max size: We want to avoid having very large files, so if a file reaches our size limit, we process it. This is done mostly to avoid having to deal with a very large file in case of debugging, and to avoid spending a lot of time when uploading the file to S3.

Once we decide to process a file, we simply use the module to upload it to S3, and then we store the index information for each session in Cassandra. This information is so small in size (just the file name, byte offset, and length) that it could be stored in any simple database.

In our case we decided to keep Cassandra for storing indexes for two main reasons:

We already had the infrastructure in place.
Our biggest issue with Cassandra was writing the big blobs, making sure the compactions happened correctly, and handling the backups efficiently. Now that the storage size is just an index, it’s way easier to maintain.

The module also allows you to upload a meta file along with the regular file, that includes all the indexes of all objects for that file, serialized as JSON. This is useful for debugging, and if for any reason you lose all the indexes, you could recover this information by reading all the meta files.

Reading the sessions

Once a client wants to fetch a session, we first get the index information from Cassandra, and then we fetch the data from S3 using the module. The module uses the Range HTTP header to fetch only that portion of the file from S3, and then we parse the session from that zstd-compressed JSON byte array.

Deleting the sessions

Given that we only store complete sessions for 14 or 30 days (depending on the client’s tier), we set custom tags to the S3 objects so that they are automatically deleted by an AWS rule after the specified time period. That way, we don’t need to handle expiring the data manually.

Here’s why this approach is beneficial:

Efficient storage with S3: Instead of storing large objects directly in Cassandra, we now batch multiple objects and store them in a single S3 file. Cassandra only holds the index information, which significantly reduces its storage requirements. This approach minimizes the need for frequent compactions, improving Cassandra’s performance.
Optimized for write-heavy workloads: This solution is ideal for scenarios with many write operations but few reads. In our specific case of handling Embrace sessions, we process tens of millions of sessions hourly, but only about 0.01% of them are queried. By batching many objects into one file, we drastically reduce the PUT operations against S3, optimizing our write efficiency and costs.
Cost savings: Moving the bulk of our data storage to S3 has resulted in substantial cost savings. S3 is more cost-effective than EBS volumes, and we no longer need to manage large-scale backups. The index information, being only a few bytes per object, could eventually be easily stored in a simple relational database.
Reduced number of transactions: As mentioned before, simply switching to S3 as a data store without batching would have been prohibitively expensive. But this approach significantly reduces the number of transactions by batching multiple objects.
Reduced dev time cost: Managing and maintaining Cassandra was costly in terms of developer time, as we had to ensure it was running smoothly, that backups were happening correctly, etc. By switching to S3, we reduced the number of things our team needed to maintain.

Cost comparison

Before making any big change, we had to run some numbers to validate that this new storage approach would actually save us money. We started by estimating S3 costs:

Assume for simplicity we process 100 million sessions per day, and each session is around ~5.5 KB in size after ZSTD(5) compression. Some of these sessions are kept for 14 days, and some for 30 days, depending on the client’s tier. For simplicity, we will assume 60% of them are stored for 14 days, and the other 40% for 30 days.

Storage cost

We would be using: 100,000,000 * 0.6 * 5,500 * 14 + 100,000,000 * 0.4 * 5,500 * 30 = ~11 TB

AWS S3 Storage cost is $0.022 USD / GB ⇒ $247 USD per month

PUT cost

Given that we would be batching 1000 sessions on average in each file, this means 200K PUT operations per day, or 6M monthly. This is because each “file” requires two PUT operations – one for the regular file and one for the meta file with the index information.

PUT cost is $0.005 USD per 1000 requests ⇒ $30 USD per month

GET cost

GET cost is almost negligible as we read less than 0.01% of the sessions, meaning that we would read 10K sessions daily, or 300K monthly.

GET cost is $0.0004 USD per 1000 requests ⇒ $0.12 USD per month

Table showing calculated storage cost savings — Example of estimated AWS monthly storage costs using s3-batch-object-store module

Now, with Cassandra, the cost of the storage would be $0.08 USD per GB per month, and with a replication factor of 3, that would give:

11 TB * 3 * 1024 * 0.08 = ~$2,700 USD

This means that with Cassandra, we would be spending just on EBS storage around $2,700 USD per month. With these changes, the EBS costs could be reduced to just $350 USD per month, and we could also get some savings by shrinking our Cassandra EC2 instances.

In this example, our total cost to store and retrieve sessions would drop from ~$2,700 to just ~$630, meaning that we would reduce it by 75%. This is just a simple example with 100 million sessions daily, but as volumes scale up, the savings become quite significant. In addition, since this approach scales better, you don’t need to worry about engineering or DevOps spending time scaling your storage infrastructure.

Metrics

After we started using our new storage approach, we reduced our storage costs by 70%, which is huge! However, we wanted to validate that we weren’t sacrificing performance in our loader and query service, so we put in place some metrics to make sure our system was still operating efficiently.

Upload time

Graph highlighting duration to upload a file to S3

We measure the time it takes to upload a single file, as we want to make sure that uploading large files won’t cause any problems, like longer file upload times or bottlenecks in our Kafka pipeline. Our upload times are still really fast, with the average being <300ms and P90 <500ms.

Throughput

Graph highlighting sessions uploaded by TTL per minute

Graph highlighting put operations per minute

We measure the number of sessions uploaded per minute and the number of PUT operations per minute. Currently, our system is configured to store files after five seconds of age, or if they reach 100MB in size. We could tweak this further, but it doesn’t change the cost too much or the time to upload too much, as the PUT cost is only ~10% of the total cost. If there’s no requirement to upload files too quickly, you could for example do batches 10x bigger than our example, and that would reduce the total cost by ~9%.

You may note that over the day, the number of sessions increases by ~3x, compared to at night, but the number of PUT operations only increases 2x. This is because of the rules we define to decide when to upload one file, as described before.

During the day, we upload bigger files (more sessions in each file) compared to at night, as we still want to flush files quickly, to have the sessions available to be queried within a few seconds.

Conclusion

Our batch object storage approach has significantly improved our storage efficiency and reduced our storage costs. We are excited to share this solution with the broader community and look forward to the innovations and enhancements that will emerge from collaborative efforts.

We invite you to explore the module on GitHub, contribute to its development, and implement it in your projects to experience the benefits firsthand.

Get started

To get started with s3-batch-object-store, visit our GitHub repository for installation instructions, usage examples, and documentation. We welcome feedback, issues, and pull requests to continue improving the module.

Try our s3-batch-object-store module

Batch upload objects to a single S3 file and retrieve each object separately using the AWS S3 API, fetching only the bytes for that specific object.

Try it now

Author

Pablo Matias Gomez

Pablo Matias Gomez is a Senior Software Engineer at Embrace, and has more than 10 years of experience in software development. He has a strong background in developing high-throughput data processing pipelines, and has experience in projects involving large-scale data processing and analysis.

Product Overview

User-focused observability for mobile and web

Use Cases

Industries

Featured Resource

Overcoming key challenges in mobile observability: A guide for modern DevOps and SRE teams

Company

Community + Support