Fixing Performance Regressions Before They Happen
Article Summary
Angus Croll from Netflix reveals how his team slashed false performance alerts by 90% while catching more real regressions. The secret? They stopped using static thresholds entirely.
Netflix's TVUI team runs performance tests on 1,700+ device types serving 222 million members. Their old approach with static memory thresholds created constant false alarms and missed subtle regressions. They needed a smarter way to detect performance issues before code shipped to production.
Key Takeaways
- Reduced alerts from 100+ to 10 per month with 90% fewer false positives
- Anomaly detection flags values 4 standard deviations above recent 40-run mean
- Changepoint detection uses e-divisive algorithm to spot distribution pattern shifts
- Running tests 3x and taking minimum value filters device noise effectively
- Dynamic thresholds adapt automatically, eliminating manual threshold adjustments
Critical Insight
By replacing static thresholds with statistical anomaly and changepoint detection, Netflix now catches genuine performance regressions earlier with 90% fewer false alerts.