Instagram • Apr 10, 2012

Keeping Instagram Up with Over a Million New Users in Twelve Hours

Article Summary

Instagram's Android launch brought 1 million new users in 12 hours. Here's how their infrastructure team kept the lights on during hypergrowth.

The Instagram engineering team shares their battle-tested playbook for handling explosive traffic spikes. This 2012 post reveals the monitoring tools and database strategies that prevented downtime during their Android app launch.

Key Takeaways

Statsd provided 10-second delayed realtime stats for instant diagnosis
Memcached boxes hit 50k req/s, becoming the main bottleneck
New Redis read-slaves deployed in under 20 minutes during traffic spikes
PGFouine analyzed PostgreSQL logs to identify and cache heavy queries
Open sourced node2dm after delivering 5 million push notifications

Critical Insight

Instagram scaled to handle massive user growth by combining realtime monitoring, rapid read-slave deployment, and targeted query optimization.

The team also reveals a simple Fabric script that makes database performance analysis a 30-second task instead of hours of manual work.

About This Article

Problem

When Instagram launched on Android, their application servers ran into trouble with memcached operations. Processes were taking over 1.5 seconds to return responses while memcached boxes were handling 50k requests per second.

Solution

Instagram Engineering built Dogslow, a Django middleware tool that captures snapshots of slow processes to disk. This let them identify memcached bottlenecks as they happened.

Impact

Once they found the exact infrastructure layer causing the delays, the team fixed the bottleneck quickly. Service stayed stable even during the 12-hour wave of new user registrations.