Pinterest Feb 11, 2025

The Quest to Understand Metric Movements

Article Summary

Your key metric just tanked 20%. Was it a code change? An OS update? A data pipeline bug? Pinterest built a platform to answer this question at scale.

Pinterest's engineering team shares how they built a root-cause analysis (RCA) platform to diagnose metric movements across their system. The platform combines three complementary approaches to narrow down why metrics shift, from performance regressions to engagement drops.

Key Takeaways

Critical Insight

Pinterest's RCA platform combines dimensional analysis, correlation detection, and experiment impact analysis to systematically diagnose metric movements across thousands of metrics.

The article reveals how they're now exploring causal discovery algorithms to move beyond correlation and provide even stronger evidence of causality.

About This Article

Problem

When metrics moved unexpectedly at Pinterest, engineers had a hard time figuring out what caused the change. The culprit could be anything from an OS upgrade to a logging error to a traffic spike. Investigating manually across multidimensional metrics was inefficient and time-consuming.

Solution

Pinterest built an RCA platform that uses segment tree analysis with customizable significance factors. It combines Pearson and Spearman correlations to find relationships in the data. Welch's t-tests filtered by harmonic mean p-values help reduce noise and false signals.

Impact

The platform can analyze experiment impacts across 2,000 metrics on-demand without needing to pre-compute results. Teams can diagnose issues ad-hoc when they need to. Application-level caching and query optimizations keep the computational costs down.