Lyft Pavlo Stavytskyi Jan 17, 2023

Detecting Android Memory Leaks in Production

Article Summary

Lyft discovered memory leaks affecting only 1% of users—issues that would never show up in local testing. Here's how they built a production monitoring system to catch what profiling tools miss.

Lyft's Android team needed visibility into memory behavior across millions of real devices and edge cases. They built a runtime monitoring system that tracks memory metrics during A/B experiments, comparing treatment vs control groups to detect regressions before full rollout.

Key Takeaways

Critical Insight

Production memory monitoring caught edge case leaks affecting the 99th percentile that local profiling would have missed entirely.

The article reveals why they rejected the standard PSS metric despite Android Studio using it, and shares the exact formulas for calculating memory footprint from Linux system files.

About This Article

Problem

Lyft's Android team couldn't identify memory leaks in production using traditional profiling tools like Android Studio Memory Profiler and LeakCanary. These tools only work locally and miss edge cases that happen across millions of diverse devices in real-world conditions.

Solution

Pavlo Stavytskyi's team built runtime memory monitoring using RSS metrics from /proc/[pid]/statm. They collected snapshots when UI screens closed and every 60 seconds during longer sessions, then compared treatment and control groups in A/B experiments.

Impact

The system caught a memory leak that only affected the 99th percentile of users. Local profiling would have missed this regression entirely. Lyft could now prevent regressions before pushing to full production.