Performance as a Core Product Feature
Article Summary
Snapchat treats performance as a core product feature, not just a requirement. Their custom tracing system catches regressions that off-the-shelf tools completely miss.
Snapchat's engineering team built a production tracing system from scratch to ensure their critical 'open-to-camera' flow stays instant for all users, not just the median. They focus on protecting p90 tail latency across device types and network conditions, catching performance issues that only affect a small percentage of users but create real frustration.
Key Takeaways
- Guards p90 tail latency instead of median to catch critical outliers
- Bounded buffer design keeps tracing overhead low and predictable in production
- Smart sampling uses tokens to target specific sessions for deep traces
- Caught disk contention, priority inversion, and language interop bottlenecks
- Retroactive spans capture full cold start story from process load
By building custom tracing infrastructure optimized for mobile constraints, Snapchat can debug complex thread interactions and gate rollouts when tail latencies regress.
About This Article
Snapchat's engineers had a hard time tracking down performance problems that standard profiling tools couldn't catch. The issues included unexpected IPC activity on the main thread during Keychain operations and heavy concurrency that created contention in the Objective-C runtime when doing dynamic class lookups.
Snap built a three-stage tracing system. It has a Tracer API for emitting Sync/Async Spans and Counters, a bounded in-memory Session Container, and a Protobuf-based Publish Pipeline. The pipeline converts session data for backend aggregation while keeping runtime overhead minimal.
With this custom infrastructure in place, Snapchat found and fixed blocking system calls, language interop bottlenecks, and priority inversion issues. These problems had been causing UI stalls and stuttering that users experienced but the team couldn't explain.