Shopify • Oct 14, 2011

Most Memory Leaks Are Good

Article Summary

Shopify Engineering shares a war story: their app servers were hitting 16GB+ memory usage, crashing with ENOMEM errors, and requiring constant restarts. The culprit? Not where they expected.

This 2011 Shopify Engineering post walks through a production memory leak crisis that stumped the team for weeks. Despite using Ruby profiling tools like Memprof and analyzing 2+ million live objects, the leak remained elusive until old-fashioned debugging revealed the truth.

Key Takeaways

Memory leaks spiked servers beyond 16GB, forcing periodic reboots to keep production alive
Memprof profiling increased response times by 1000%, making production analysis nearly impossible
The leak was in a C extension (memcached client), invisible to Ruby-level tools
Reproducibility plus teamwork cracked it: simulating bad memcached connections exposed the pattern

Critical Insight

The memory leak turned out to be in a C extension that Ruby profiling tools couldn't detect, solved only by reproducing production conditions locally and trusting hunches.

The team's initial approach with VM dumps and MongoDB analysis of 2 million objects led them down the wrong path entirely.

About This Article

Problem

Shopify's production app servers kept running out of memory, growing past the 16GB physical limit. The team tried rebooting servers periodically to keep things running, but couldn't figure out what was causing it.

Solution

They reproduced the problem locally by simulating a bad memcached connection using the code loop { Rails.cache.write(rand(10**10).to_s, rand(10**10).to_s) }. Once they found the leak in the C extension, they switched to a different memcached client library.

Impact

Memory usage stabilized after the switch, and production servers stopped needing constant restarts. The Errno::ENOMEM crashes that had been affecting the infrastructure went away.