Pinterest Mar 10, 2021

How Pinterest Leverages Honeycomb to Enhance CI Observability and Improve CI Build Stability

Article Summary

Pinterest sends 1 million CI build events to Honeycomb daily. Here's how they turned that data into a competitive advantage for mobile development.

Pinterest's Mobile Builds team shares how they've used Honeycomb since 2021 to transform CI observability. Staff Engineer Oliver Koo walks through real examples of debugging build failures, identifying bottlenecks, and automating incident response.

Key Takeaways

Critical Insight

Pinterest uses Honeycomb's query speed and trace visualization to diagnose CI issues in real time, moving from reactive firefighting to proactive optimization.

The article reveals Pinterest's architecture for automated error categorization that's changing how they handle on-call duties.

About This Article

Problem

Pinterest's mobile CI infrastructure collected tons of observability data, but the team couldn't see which specific build jobs were causing p95 latency spikes. It was hard to tell if slower builds were just from normal load increases or actual performance problems.

Solution

Oliver Koo's team set up Buildkite jobs as child spans in Honeycomb traces. This broke down each job into smaller pieces like agent wait time and script execution. They used derived columns to create metrics on the fly for better analysis.

Impact

The team found that the 'super secretive tests' job was slowing down p95 build times. They also discovered that CI agent wait times spiked when the cluster was saturated. This let them make targeted infrastructure improvements instead of just scaling blindly.