How Pinterest Leverages Honeycomb to Enhance CI Observability and Improve CI Build Stability
Article Summary
Pinterest sends 1 million CI build events to Honeycomb daily. Here's how they turned that data into a competitive advantage for mobile development.
Pinterest's Mobile Builds team shares how they've used Honeycomb since 2021 to transform CI observability. Staff Engineer Oliver Koo walks through real examples of debugging build failures, identifying bottlenecks, and automating incident response.
Key Takeaways
- Most queries complete in under 1 second despite 1M daily events
- Trace view breaks builds into granular segments like agent wait time and script execution
- Correlation feature overlays metrics to pinpoint root causes (like long agent wait times)
- Error categorization routes alerts to the right team automatically, reducing noise
- Custom instrumentation tracks specific Bazel target build times within jobs
Pinterest uses Honeycomb's query speed and trace visualization to diagnose CI issues in real time, moving from reactive firefighting to proactive optimization.
About This Article
Pinterest's mobile CI infrastructure collected tons of observability data, but the team couldn't see which specific build jobs were causing p95 latency spikes. It was hard to tell if slower builds were just from normal load increases or actual performance problems.
Oliver Koo's team set up Buildkite jobs as child spans in Honeycomb traces. This broke down each job into smaller pieces like agent wait time and script execution. They used derived columns to create metrics on the fly for better analysis.
The team found that the 'super secretive tests' job was slowing down p95 build times. They also discovered that CI agent wait times spiked when the cluster was saturated. This let them make targeted infrastructure improvements instead of just scaling blindly.