Skyscanner’s journey to effective observability
Article Summary
Skyscanner was drowning in observability chaos: multiple vendors, fragmented tools, and engineers losing confidence in their ability to debug production issues.
During COVID-19, Skyscanner's platform team seized the opportunity to completely overhaul their observability stack. They migrated 300+ microservices from a patchwork of specialized vendors and internal systems to a unified approach built on open standards.
Key Takeaways
- Standardized on OpenTelemetry and New Relic to eliminate context switching across tools
- Migrated 300+ microservices in weeks using automated PRs via Turbolift
- Teams reduced telemetry costs by 90% using smart sampling on 2M spans/second
- Created Observability Ambassadors program to drive cultural adoption across teams
- Shifted SLOs from API metrics to actual user experience signals
Skyscanner transformed observability from a technical burden into a sociotechnical tool that connects 110M travelers to 1,200+ partners with data-driven confidence.
About This Article
Skyscanner's monitoring setup was scattered across different vendors for RUM, tracing, and synthetics, plus internal systems running OpenTSDB, Prometheus, and ELK. Engineers couldn't easily connect signals across services, which made the whole system harder to work with.
Skyscanner moved to OpenTelemetry APIs and semantic conventions for traces, metrics, logs, and baggage. They set up a centralized Collector Gateway that sends data via the standard OTLP protocol to New Relic as their single backend.
With smart sampling strategies on distributed traces, Skyscanner now stores just 4% of their 2M spans and 80K traces per second while keeping full debugging capability. This cut their telemetry costs by over 90%.