In industries with large event streams such as adtech, ride-hailing, and streaming, managing data efficiently is an ongoing task. As traffic grows, so do logs and raw records. At Appodeal, we handle around 35 PB, and our cloud costs increased by roughly 30% between May and June 2025.
We had three ways to address the storage load.
1. Delete history
We could cut retention, but analytics would suffer. In adtech, you need two to three years of data to tell a coincidence from a trend.
2. Sample events
We could sample, keeping every tenth row. Analytics quality would drop only slightly, and a 5% deviation is fine for dashboards. But storage would keep growing, so it is only a partial fix. And for us it is not an option anyway, because our demand-side platform requires complete datasets for training and optimization.
3. Compress, but differently
We could change how we store those 35 PB under management.
Snappy vs Zstandard
In big-data stacks, Snappy is the comfortable default. It prioritizes speed with modest compression. Zstandard (zstd) achieves higher compression ratios with comparable decode speed. It’s not new, yet it still isn’t the default in many pipelines. We’re sharing this case so others can benchmark their own stacks.
How we engineered it
We started writing all new data with zstd and gradually backfilled old partitions in the background to avoid disrupting compute or the daily work of over 100 internal users. To make this sustainable, engineers built a small utility that profiles datasets and recommends optimal zstd levels and compaction settings for each one. This approach delivered better compression ratios without added read latency — heavy raw logs compressed more, curated tables less.
As a result, we cut our cloud storage costs by half between June and September.
Who should consider this
You’re in an industry with massive event streams: adtech, social platforms, ride-hailing, large marketplaces, or streaming. You’re likely running in the cloud rather than on your own racks, paying for storage by the gigabyte-month, and keeping raw history for ML and BI because seasonality and long tails matter.
If you’re operating at Netflix scale with your own hardware, you’re playing a different game. For everyone else, it’s worth questioning Snappy as the default.
Next: balanced data tiers
Data scientists need every small signal, but mostly from the last couple of months. BI teams, on the other hand, rely on multi-year aggregates to track trends. To meet the needs of both, we plan to separate data management into the following tiers by Q1 2026:
- Hot raw: recent months of full-precision events for data science
- Warm curated: multi-year BI-grade tables for trend analysis
- Cold aggregates or deletion: anything beyond the useful horizon
Credits
Ihor Bobak, Senior Data Science Engineer
Roman Malyushkin, Data Lead, Appodeal
Dmytrii Mamatiusupov, Senior Data Engineer
Data Engineering team, Appodeal