Most Memory Leaks Are Good
Article Summary
Shopify Engineering shares a war story: their app servers were hitting 16GB+ memory usage, crashing with ENOMEM errors, and requiring constant restarts. The culprit? Not where they expected.
This 2011 Shopify Engineering post walks through a production memory leak crisis that stumped the team for weeks. Despite using Ruby profiling tools like Memprof and analyzing 2+ million live objects, the leak remained elusive until old-fashioned debugging revealed the truth.
Key Takeaways
- Memory leaks spiked servers beyond 16GB, forcing periodic reboots to keep production alive
- Memprof profiling increased response times by 1000%, making production analysis nearly impossible
- The leak was in a C extension (memcached client), invisible to Ruby-level tools
- Reproducibility plus teamwork cracked it: simulating bad memcached connections exposed the pattern
The memory leak turned out to be in a C extension that Ruby profiling tools couldn't detect, solved only by reproducing production conditions locally and trusting hunches.
About This Article
Shopify's production app servers kept running out of memory, growing past the 16GB physical limit. The team tried rebooting servers periodically to keep things running, but couldn't figure out what was causing it.
They reproduced the problem locally by simulating a bad memcached connection using the code loop { Rails.cache.write(rand(10**10).to_s, rand(10**10).to_s) }. Once they found the leak in the C extension, they switched to a different memcached client library.
Memory usage stabilized after the switch, and production servers stopped needing constant restarts. The Errno::ENOMEM crashes that had been affecting the infrastructure went away.