Tracing at Slack: Thinking in Causal Graphs
Article Summary
Slack processes 8.5 billion spans daily, but their tracing system looks nothing like traditional distributed tracing. Here's why they rebuilt it from scratch.
Suman Karumuri, Sr. Staff Engineer on Slack's Observability team, explains how traditional tracing tools (Zipkin, Jaeger) failed to meet their needs across mobile apps, shell scripts, and backend services. Their solution: model traces as "Causal Graphs" built from a simplified SpanEvent structure that's designed for human consumption and SQL queries.
Key Takeaways
- Traditional tracing APIs are too request-centric for mobile apps and CI/CD pipelines
- SpanEvents use flat structure with duration fields, queryable via SQL in under 5 seconds
- System traces 1% of requests: 310M traces/day producing 2TB of data
- Engineers can trace specific users via slash command without sampling limits
- Real-time store has sub-5-second latency; data warehouse enables complex historical analysis
Critical Insight
By treating spans as queryable rows instead of nested objects, Slack made tracing simple enough to adopt across their entire stack, from mobile clients to Jenkins builds.