Tracing at Slack: Thinking in Causal Graphs

Article Summary

Slack processes 8.5 billion spans daily, but their tracing system looks nothing like traditional distributed tracing. Here's why they rebuilt it from scratch.

Suman Karumuri, Sr. Staff Engineer on Slack's Observability team, explains how traditional tracing tools (Zipkin, Jaeger) failed to meet their needs across mobile apps, shell scripts, and backend services. Their solution: model traces as "Causal Graphs" built from a simplified SpanEvent structure that's designed for human consumption and SQL queries.

Key Takeaways

Traditional tracing APIs are too request-centric for mobile apps and CI/CD pipelines
SpanEvents use flat structure with duration fields, queryable via SQL in under 5 seconds
System traces 1% of requests: 310M traces/day producing 2TB of data
Engineers can trace specific users via slash command without sampling limits
Real-time store has sub-5-second latency; data warehouse enables complex historical analysis

Critical Insight

By treating spans as queryable rows instead of nested objects, Slack made tracing simple enough to adopt across their entire stack, from mobile clients to Jenkins builds.

The team reveals how they're consolidating logs, traces, and events into a unified system, and why they added a special slash command for customer support investigations.

Tracing at Slack: Thinking in Causal Graphs

Article Summary

Key Takeaways

Recent from Slack

Related Articles

Related Articles

Performance monitoring in Mercari mobile apps

Mercari tracks mobile app performance live to stay quick and steady.

Mercari • Nov 29, 2023

Distributed Load Testing Using Locust

Glance tests their app’s limits with Locust for rock-solid reliability.

Glance • Jun 15, 2023

Why Mobile Application Performance Testing Is Key to App Success

Amazon shows why testing app speed is make-or-break for success.

Amazon • May 31, 2023

DragonCrawl: Generative AI for High-Quality Mobile Testing

Uber’s AI-powered DragonCrawl makes mobile testing sharper and more efficient.

Uber • May 10, 2023

Tracing at Slack: Thinking in Causal Graphs

Article Summary

Key Takeaways

Recent from Slack

Android VPAT journey

Client Tracing: Understanding Mobile and Desktop Application Performance at Scale

Mobile Developer Experience at Slack

Scaling Slack’s Mobile Codebases: Modernization

Related Articles

Performance monitoring in Mercari mobile apps

Distributed Load Testing Using Locust

Why Mobile Application Performance Testing Is Key to App Success

DragonCrawl: Generative AI for High-Quality Mobile Testing