Sentry Edward Gou Jan 17, 2024

How We Improved Performance Score Accuracy

Article Summary

Edward Gou from Sentry reveals why their performance scores were lying to developers. A single slow pageload could tank your entire app's score, even when 99% of users had fast experiences.

Sentry's Performance Score condenses multiple Web Vitals (LCP, FCP, FID, TTFB, CLS) into a 0-100 rating based on real user data. But their original calculation method had a fatal flaw: it aggregated metrics first, then scored them. This meant outliers could completely misrepresent actual user experience.

Key Takeaways

Critical Insight

By scoring individual pageloads before averaging instead of aggregating metrics first, Sentry fixed how outliers were unfairly tanking performance scores that should have reflected mostly positive user experiences.

The mathematical function they use (Complementary Log-Normal CDF) and the specific weight distribution across Web Vitals reveal interesting priorities about what matters most for perceived performance.

About This Article

Problem

Sentry's Performance Scores used static weighting across five Web Vitals (LCP 30%, FID 30%, CLS 15%, FCP 15%, TTFB 10%). This meant that metrics like LCP, which might only have one Chrome sample, got weighted the same as metrics with 100 Safari pageloads, even though Safari doesn't support LCP.

Solution

Sentry changed how it calculates Performance Scores. Instead of aggregating Web Vitals first and then scoring, it now calculates a score for each individual pageload using the Complementary Log-Normal CDF function. These scores are bounded between 0 and 100, then averaged together with dynamically adjusted weights that skip any missing Web Vitals for that specific pageload.

Impact

Because individual pageload scores cap at 100, outliers have less impact on the final average. This makes scores consistent and accurate whether you're looking at app-level data, drilling down to a specific page, or examining a single pageload.