Slack Aug 28, 2020

Tracing at Slack: Thinking in Causal Graphs

Article Summary

Slack processes 8.5 billion spans daily, but their tracing system looks nothing like traditional distributed tracing. Here's why they rebuilt it from scratch.

Suman Karumuri, Sr. Staff Engineer on Slack's Observability team, explains how traditional tracing tools (Zipkin, Jaeger) failed to meet their needs across mobile apps, shell scripts, and backend services. Their solution: model traces as "Causal Graphs" built from a simplified SpanEvent structure that's designed for human consumption and SQL queries.

Key Takeaways

Critical Insight

By treating spans as queryable rows instead of nested objects, Slack made tracing simple enough to adopt across their entire stack, from mobile clients to Jenkins builds.

The team reveals how they're consolidating logs, traces, and events into a unified system, and why they added a special slash command for customer support investigations.

Recent from Slack

Related Articles