When Half a Billion Tiny Files Broke Our Best Laid Plans
A tale of hubris, scale, and sometimes just using a really big machine
So here's how we learned that 500 million tiny files will humble every "proper" engineering approach you can think of.
It started innocently enough. We needed to backfill our data warehouse as part of Ingestion Pipeline V2, migrate all our historical consumption data from the old S3 bucket format to our shiny new warehouse ready structure. How hard could it be?
We had two buckets to deal with:
datamover bucket: 50M files (11M actually had data)
raw-consumptions bucket: 500M files across ~1.7M users
The datamover bucket was going to be our practice round. Get the approach right on 11M files, then scale it up to 500M. Classic engineering, start small, iterate, then scale.
Spoiler alert: sometimes "scale it up" breaks everything you thought you knew.
Act I: The Proper Engineering Approach
"Let's do this right. Database indexes, sequential processing, proper state management."
We built what any engineer would build for a data migration:
create table migration_file_index
(
id bigserial primary key,
s3_key text not null unique,
platform text not null,
job_enqueued_at timestamp with time zone not null,
file_size bigint not null,
processed boolean default false,
created_at timestamp with time zone default now()
);
With proper indexes, of course, we are professionals:
create index idx_migration_sort
on migration_file_index (platform, job_enqueued_at)
where (NOT processed);
create index idx_migration_resume
on migration_file_index (processed, id);
create index idx_migration_platform_stats
on migration_file_index (platform, processed);
The architecture was clean:
DatabaseIndexer: Catalog all S3 files that had data into Postgres
SequentialProcessor: Process files in chronological order
PartitionBuffer: Smart batching for optimal warehouse file sizes
It felt good. It felt right. This is how you handle a large-scale data migration.
Then we ran it on our practice dataset.
11 million files took one week.
For the datamover migration, a week seemed totally reasonable. Less than a sprint! Exactly what product wants to hear. "Yeah, we'll have the historical data migrated by next Friday." Perfect.
Let me repeat that for emphasis: 11 million files. One week.
I remember staring at the progress logs, doing the math in my head. If 11M files = 1 week, then 500M files = ... reaches for calculator ... 45 weeks.
Forty-five weeks. Almost a year. To backfill the raw-consumptions bucket historical data.
That's when we realized that all our careful database design, our proper indexes, our sequential processing - it was all overhead. We were spending more time managing the migration state than actually migrating data.
Act II: The Cloud-Native Pivot
"Surely AWS can handle what we can't, right?"
Time for a complete rethink. Why are we trying to manage 500M individual files when AWS has services built exactly for this kind of massive parallel processing?
Enter the Athena approach:
Create an external table mapping to our raw-consumptions bucket
Let Athena discover and process all 500M files in a single massive query
Output everything to a staging bucket in warehouse format
Do a controlled cross-bucket migration to production
The promise was beautiful:
2-6 hours instead of months
~$100-200 in Athena costs vs. massive compute bills
Leverage cloud services for what they're actually built for
// Instead of this:
for (const file of 500_000_000_files) {
await processFileSequentially(file); // 🤮
}
// We'd do this:
CREATE TABLE warehouse_data AS
SELECT
platform,
user_id,
payload,
ingested_at,
date_partition,
hour_partition
FROM external_consumptions_table;
// Let Athena figure it out in parallel!
We built comprehensive tooling around it:
Quality validation (95% success rate required)
Cross-bucket migration with configurable batch sizes
Progress monitoring and error recovery
Staging-to-production promotion pipeline
It was elegant. It was cloud-native. It was exactly the kind of solution you'd present at a conference lol.
And Athena choked.
Turns out even AWS services have limits when you point them at half a billion tiny files. Queries would time out consistently. Tuns out 500M files is just too many god damn files!!
The cloud-native dream tried hard, but it died.
Act III: The Big Machine Epiphany
"You know what? Let's just use a really big machine."
Sometimes the most sophisticated solution is admitting when sophistication isn't working.
We spun up an i4i.metal instance:
128 vCPUs
1TB of RAM
30TB of NVMe storage
75 Gbps network
The plan was beautifully simple:
Download everything to local NVMe storage
Process it with maximum parallelism
Upload the results
No databases to maintain. No cloud service limits to hit. No complex state management. Just raw compute power and storage bandwidth.
Step 1: Setup the Beast
# Create RAID 0 across all NVMe drives
NVME_DEVICES=$(lsblk -nd -o NAME | grep nvme | grep -v nvme0 | sed 's/^/\\/dev\\//' | tr '\\n' ' ')
sudo mdadm --create /dev/md0 --level=0 --raid-devices=$NVME_COUNT $NVME_DEVICES
sudo mkfs.ext4 -F /dev/md0
sudo mount /dev/md0 /mnt/nvme
Step 2: Download Everything
# s5cmd with 4096 workers to max out the 100 Gbps network
s5cmd --numworkers 4096 cp "s3://consumptions-raw/*" /mnt/nvme/raw-data/
Enter s5cmd - if you haven't used it, it's like AWS CLI's faster, more parallel cousin. While aws s3 cp might give you decent performance, s5cmd is built specifically for high-throughput S3 operations. It can spin up thousands of concurrent workers and actually saturate high-bandwidth connections.
With our 100 Gbps network and s5cmd cranked up to 4096 workers, we could download at speeds that would make your standard S3 sync weep.
Step 3: The Organize Breakthrough
This is where it got interesting. We needed to transform the data from:
platform/user123/2024-01-15T14:30:15.json
platform/user123/2024-01-15T14:45:22.json
platform/user456/2024-01-15T14:32:18.json
To warehouse partitions:
platform=spotify/date=2024-01-15/hour=14/part1.gz
Our first approach was to process files sequentially, organizing by timestamp. Still too slow.
Then came the insight: "Wait, we have ~1.7M users here. What if we parallelize by user instead of by file?"
# Instead of processing 500M files sequentially:
for file in $(find . -name "*.json"); do
organize_by_timestamp $file # Still 500M operations!
done
# Process ~1.7M users in parallel:
find raw-data/ -maxdepth 2 -type d -name "user_*" | \\
parallel -j 128 ./[organize-single-user.sh](<http://organize-single-user.sh>) {} /mnt/nvme/partitions
Each worker would:
Take a user directory
Parse all their timestamp.json files
Route each file to the correct date/hour partition
Coordinate at the partition level to merge everything
This reframed the problem completely. Instead of "how do we process 500M files?" it became "how do we process 1.7M users in parallel?"
One is hard. The other is Tuesday.
Step 4: Process Partitions
Now each partition could be processed independently:
# Each partition just worries about:
# 1. Collecting all files for that date/hour
# 2. Gzipping them efficiently
# 3. Creating warehouse-format records
find /mnt/nvme/partitions -name "file_list.txt" | \\
parallel -j 128 --progress ./[process-single-partition.sh](<http://process-single-partition.sh>)
Step 5: Upload and Snowpipe Magic
# Upload to partitioned S3 structure
s5cmd --numworkers 4096 cp /mnt/nvme/output/* s3://raw-extractions-prod/data/
And here's the beautiful part: once the files land in S3 with the correct partitioned structure (platform=spotify/date=2024-01-15/hour=14/
), S3 notifications automatically trigger Snowpipe to ingest them into our data warehouse. No manual intervention. No complex orchestration. Just files appearing in the right place, and the rest happens automatically.
It worked. Fast.
And here's the kicker: the entire 500M file migration took about a week. The same time it took our initial database approach to handle 11M files.
Let that sink in for a moment. We went from processing 11M files in a week to processing 500M files in a week. That's a 45x improvement in efficiency, not just from throwing more hardware at the problem, but from completely reframing how we thought about it.
What We Actually Learned
About Scale
500 million files isn't just "a lot of files." It's a scalability problem that broke our initial assumptions and iterations.
The overhead of "doing it right" - indexes, state management, proper abstractions - can exceed the actual work when you hit extreme scale. Sometimes the most elegant solution is the one that avoids the problem entirely.
About Problem Reframing
The breakthrough wasn't better algorithms or more sophisticated architecture. It was changing the unit of work:
❌ "How do we process 500M files efficiently?"
✅ "How do we process 1.7M users in parallel?"
Sometimes the bottleneck isn't compute power, it's how you think about the problem.
About Cloud Services
AWS services are incredibly powerful, but they have limits. You just have to test them!
"Cloud-native" isn't automatically better than "local processing with a big machine." The right tool depends on your specific constraints and scale.
About Engineering Pragmatism
The best architecture is the one that ships. There's real elegance in recognizing when your careful abstractions are getting in the way and just solving the actual problem.
Sometimes "just use a bigger machine" isn't lazy engineering, sometimes it's the sophisticated approach.
The Real Lesson
This isn't a story about finding the "right" solution. It's about recognizing when your assumptions break down and being willing to completely change approaches.
Approach 1 failed because we tried to apply normal database patterns to abnormal scale
Approach 2 failed because we assumed cloud services could handle anything
Approach 3 succeeded because we stopped trying to be clever and just solved the problem
The most elegant engineering isn't always in the architecture - sometimes it's in knowing when to stop being architectural.
Now our data warehouse hums along, ingesting new files automatically via Snowpipe, and our 500 million historical files are exactly where they need to be. Sometimes brute force is the sophisticated solution.
Want to work on problems like this? We’re hiring widely at Koodos. If this kind of scale, reframing, and pragmatism excites you, drop us a line at team@koodos.com.