Lyft Jingwei Hao Dec 17, 2020

Introducing Pulse: Envoy Mobile's stats library

M5 Related OWASP risk: Insecure Communication Learn more →

Article Summary

Jingwei Hao from Lyft reveals how real-time stats APIs caught a production crash spike at 9:55am, enabling engineers to ship a hotfix before most users even noticed the problem.

Lyft built Pulse, a stats library for Envoy Mobile that brings server-side observability practices (counters, gauges, histograms) to mobile apps. Unlike traditional crash reporting (minutes delay) or analytics events (longer resolution), Pulse reports time-series data in real-time, integrating with PagerDuty and dashboards just like backend services.

Key Takeaways

Critical Insight

Pulse makes time-series metrics a mobile development necessity by enabling the same real-time observability that backend engineers take for granted.

The article details exactly how Lyft's stats flow from mobile clients to their observability systems, plus what's coming with histogram support and stat tagging.

About This Article

Problem

Mobile teams traditionally used crash reporting tools like Crashlytics, which had minutes-level latency, or analytics systems with even longer delays. This made real-time anomaly detection and rapid incident response nearly impossible.

Solution

Jingwei Hao's team at Lyft built Pulse with Counter and Gauge APIs that send time-series data to a gRPC service based on StatsD. The stats are serialized as Prometheus MetricFamily so they work with existing backend observability systems.

Impact

Mobile on-call engineers now get immediate PagerDuty alerts when metrics spike. This lets them identify problems and deploy hotfixes before users are widely affected. When the app crash metric spiked recently, the team caught and fixed it the same morning.