Lyft Michael Rebello Oct 17, 2022

Recovering from Crashes with Safe Mode

Article Summary

Lyft engineers faced a nightmare scenario: feature flags causing infinite crash loops on app launch, requiring emergency hotfixes and losing revenue. They built Safe Mode to break the cycle.

Michael Rebello from Lyft Engineering shares how they created an automated recovery system that detects crash loops caused by bad configuration changes and prevents users from getting stuck in an unusable app state.

Key Takeaways

Critical Insight

Lyft's Safe Mode automatically recovers from configuration-induced crash loops, avoiding hotfixes while keeping affected users productive during incident resolution.

The team discovered an unexpected benefit: Safe Mode exposed a hidden thread safety bug that had been lurking in their networking layer for years.

About This Article

Problem

Lyft's app was crashing on launch due to misconfigured feature flags. These crashes were triggering Safe Mode at low volumes throughout the day. The root cause was a thread-safety bug in network request headers that appeared during app startup when data was being refreshed.

Solution

Michael Rebello's team found the issue by correlating Safe Mode events with crash reports and examining the stacktraces. They fixed the thread-safety bug that was causing race conditions during app initialization.

Impact

The fix reduced false positives and brought monitoring graphs back to normal baseline levels. Engineers could now tell the difference between actual configuration problems and unrelated crashes.