The Quest to Understand Metric Movements
Article Summary
Your key metric just tanked 20%. Was it a code change? An OS update? A data pipeline bug? Pinterest built a platform to answer this question at scale.
Pinterest's engineering team shares how they built a root-cause analysis (RCA) platform to diagnose metric movements across their system. The platform combines three complementary approaches to narrow down why metrics shift, from performance regressions to engagement drops.
Key Takeaways
- Slice and Dice: Tree-based segmentation inspired by LinkedIn's ThirdEye algorithm
- General Similarity: Four correlation factors to find metrics moving together
- Experiment Effects: Reverse A/B testing across 2,000+ metrics dynamically
- Discovered link between content shifts and latency via statistical signals
Pinterest's RCA platform combines dimensional analysis, correlation detection, and experiment impact analysis to systematically diagnose metric movements across thousands of metrics.
About This Article
When metrics moved unexpectedly at Pinterest, engineers had a hard time figuring out what caused the change. The culprit could be anything from an OS upgrade to a logging error to a traffic spike. Investigating manually across multidimensional metrics was inefficient and time-consuming.
Pinterest built an RCA platform that uses segment tree analysis with customizable significance factors. It combines Pearson and Spearman correlations to find relationships in the data. Welch's t-tests filtered by harmonic mean p-values help reduce noise and false signals.
The platform can analyze experiment impacts across 2,000 metrics on-demand without needing to pre-compute results. Teams can diagnose issues ad-hoc when they need to. Application-level caching and query optimizations keep the computational costs down.