Skyscanner • Mar 14, 2025

Skyscanner’s journey to effective observability

Article Summary

Skyscanner was drowning in observability chaos: multiple vendors, fragmented tools, and engineers losing confidence in their ability to debug production issues.

During COVID-19, Skyscanner's platform team seized the opportunity to completely overhaul their observability stack. They migrated 300+ microservices from a patchwork of specialized vendors and internal systems to a unified approach built on open standards.

Key Takeaways

Standardized on OpenTelemetry and New Relic to eliminate context switching across tools
Migrated 300+ microservices in weeks using automated PRs via Turbolift
Teams reduced telemetry costs by 90% using smart sampling on 2M spans/second
Created Observability Ambassadors program to drive cultural adoption across teams
Shifted SLOs from API metrics to actual user experience signals

Critical Insight

Skyscanner transformed observability from a technical burden into a sociotechnical tool that connects 110M travelers to 1,200+ partners with data-driven confidence.

The secret weapon that made adoption stick wasn't technical at all (it involved gamifying system debugging with the OpenTelemetry Demo).

About This Article

Problem

Skyscanner's monitoring setup was scattered across different vendors for RUM, tracing, and synthetics, plus internal systems running OpenTSDB, Prometheus, and ELK. Engineers couldn't easily connect signals across services, which made the whole system harder to work with.

Solution

Skyscanner moved to OpenTelemetry APIs and semantic conventions for traces, metrics, logs, and baggage. They set up a centralized Collector Gateway that sends data via the standard OTLP protocol to New Relic as their single backend.

Impact

With smart sampling strategies on distributed traces, Skyscanner now stores just 4% of their 2M spans and 80K traces per second while keeping full debugging capability. This cut their telemetry costs by over 90%.