Keeping Instagram Up with Over a Million New Users in Twelve Hours
Article Summary
Instagram's Android launch brought 1 million new users in 12 hours. Here's how their infrastructure team kept the lights on during hypergrowth.
The Instagram engineering team shares their battle-tested playbook for handling explosive traffic spikes. This 2012 post reveals the monitoring tools and database strategies that prevented downtime during their Android app launch.
Key Takeaways
- Statsd provided 10-second delayed realtime stats for instant diagnosis
- Memcached boxes hit 50k req/s, becoming the main bottleneck
- New Redis read-slaves deployed in under 20 minutes during traffic spikes
- PGFouine analyzed PostgreSQL logs to identify and cache heavy queries
- Open sourced node2dm after delivering 5 million push notifications
Instagram scaled to handle massive user growth by combining realtime monitoring, rapid read-slave deployment, and targeted query optimization.
About This Article
When Instagram launched on Android, their application servers ran into trouble with memcached operations. Processes were taking over 1.5 seconds to return responses while memcached boxes were handling 50k requests per second.
Instagram Engineering built Dogslow, a Django middleware tool that captures snapshots of slow processes to disk. This let them identify memcached bottlenecks as they happened.
Once they found the exact infrastructure layer causing the delays, the team fixed the bottleneck quickly. Service stayed stable even during the 12-hour wave of new user registrations.