Appodeal

October 17, 2025

Halving Data Costs

If you run your workloads in the cloud, this can save you hundreds of thousands a month.

Data Lead at Appodeal

In industries with large event streams such as adtech, ride-hailing, and streaming, managing data efficiently is an ongoing task. As traffic grows, so do logs and raw records. At Appodeal, we handle around 35 PB, and our cloud costs increased by roughly 30% between May and June 2025.

We had three ways to address the storage load.

1. Delete history

We could cut retention, but analytics would suffer. In adtech, you need two to three years of data to tell a coincidence from a trend.

2. Sample events

We could sample, keeping every tenth row. Analytics quality would drop only slightly, and a 5% deviation is fine for dashboards. But storage would keep growing, so it is only a partial fix. And for us it is not an option anyway, because our demand-side platform requires complete datasets for training and optimization.

3. Compress, but differently

We could change how we store those 35 PB under management.

Snappy vs Zstandard

In big-data stacks, Snappy is the comfortable default. It prioritizes speed with modest compression. Zstandard (zstd) achieves higher compression ratios with comparable decode speed. It’s not new, yet it still isn’t the default in many pipelines. We’re sharing this case so others can benchmark their own stacks.

How we engineered it

We started writing all new data with zstd and gradually backfilled old partitions in the background to avoid disrupting compute or the daily work of over 100 internal users. To make this sustainable, engineers built a small utility that profiles datasets and recommends optimal zstd levels and compaction settings for each one. This approach delivered better compression ratios without added read latency — heavy raw logs compressed more, curated tables less.

As a result, we cut our cloud storage costs by half between June and September.

Who should consider this

You’re in an industry with massive event streams: adtech, social platforms, ride-hailing, large marketplaces, or streaming. You’re likely running in the cloud rather than on your own racks, paying for storage by the gigabyte-month, and keeping raw history for ML and BI because seasonality and long tails matter.

If you’re operating at Netflix scale with your own hardware, you’re playing a different game. For everyone else, it’s worth questioning Snappy as the default.

Next: balanced data tiers

Data scientists need every small signal, but mostly from the last couple of months. BI teams, on the other hand, rely on multi-year aggregates to track trends. To meet the needs of both, we plan to separate data management into the following tiers by Q1 2026:

Hot raw: recent months of full-precision events for data science
Warm curated: multi-year BI-grade tables for trend analysis
Cold aggregates or deletion: anything beyond the useful horizon

Credits

Ihor Bobak, Senior Data Science Engineer

Roman Malyushkin, Data Lead, Appodeal

Dmytrii Mamatiusupov, Senior Data Engineer

Data Engineering team, Appodeal

Sign up to Appodeal

Create an account and turn your mobile apps into top earning hits!

Your name *

This field is required

Business email *

This field is required

Password *

This field is required

How did you hear about us? *

This field is required

I have read and accept Terms of Service, Privacy Policy and SDK License Agreement. I understand I will receive important communications related to my account, such as payment notifications and service updates.

You must agree before submitting

or

Monthly newsletter on growth & monetization

+1 (415) 996-6877
HQ in McLean, Virginia

Member of the Interactive Advertising Bureau

Halving Data Costs

1. Delete history

2. Sample events

3. Compress, but differently

Snappy vs Zstandard

How we engineered it

Who should consider this

Next: balanced data tiers

Credits

Sign up to Appodeal

Thank you for signing up!

Thank you for signing up!

Monthly newsletter on growth & monetization