Detecting Android Memory Leaks in Production
Article Summary
Lyft discovered memory leaks affecting only 1% of users—issues that would never show up in local testing. Here's how they built a production monitoring system to catch what profiling tools miss.
Lyft's Android team needed visibility into memory behavior across millions of real devices and edge cases. They built a runtime monitoring system that tracks memory metrics during A/B experiments, comparing treatment vs control groups to detect regressions before full rollout.
Key Takeaways
- RSS metric chosen over PSS: faster to collect, acceptable tradeoff for comparative analysis
- Memory snapshots captured on screen close and every 60 seconds during long sessions
- Percentile distribution reveals edge case leaks that average values completely hide
- One experiment showed identical averages but 99th percentile exposed critical memory leak
- System catches regressions from native C/C++ code that's hardest to profile locally
Production memory monitoring caught edge case leaks affecting the 99th percentile that local profiling would have missed entirely.
About This Article
Lyft's Android team couldn't identify memory leaks in production using traditional profiling tools like Android Studio Memory Profiler and LeakCanary. These tools only work locally and miss edge cases that happen across millions of diverse devices in real-world conditions.
Pavlo Stavytskyi's team built runtime memory monitoring using RSS metrics from /proc/[pid]/statm. They collected snapshots when UI screens closed and every 60 seconds during longer sessions, then compared treatment and control groups in A/B experiments.
The system caught a memory leak that only affected the 99th percentile of users. Local profiling would have missed this regression entirely. Lyft could now prevent regressions before pushing to full production.