Detecting Android Memory Leaks in Production
Article Summary
Lyft discovered memory leaks affecting only 1% of users—issues that would never show up in local testing. Here's how they built a production monitoring system to catch what profiling tools miss.
Lyft's Android team needed visibility into memory behavior across millions of real devices and edge cases. They built a runtime monitoring system that tracks memory metrics during A/B experiments, comparing treatment vs control groups to detect regressions before full rollout.
Key Takeaways
- RSS metric chosen over PSS: faster to collect, acceptable tradeoff for comparative analysis
- Memory snapshots captured on screen close and every 60 seconds during long sessions
- Percentile distribution reveals edge case leaks that average values completely hide
- One experiment showed identical averages but 99th percentile exposed critical memory leak
- System catches regressions from native C/C++ code that's hardest to profile locally
Critical Insight
Production memory monitoring caught edge case leaks affecting the 99th percentile that local profiling would have missed entirely.