Slack: Continuous Load Testing

Article Summary

Slack's engineering team ran into a classic performance testing problem: spinning up load tests was so time-consuming that teams avoided doing it. Their solution? Never stop testing.

Slack's Performance Infrastructure team built Koi Pond, a load testing platform that simulates hundreds of thousands of users. This article details how they evolved it into a continuous system that runs 24/7, testing against an organization 4x larger than their biggest customer.

Key Takeaways

Automated token generation replaced a manual hours-long process prone to errors
Automatic Shutdown service monitors metrics and kills tests below 95% success rate
Continuous testing caught performance bugs before production with zero off-hours pages
Feature coverage increased 10% as product teams adopted always-on testing
Cost stayed low despite running 500,000 simulated clients continuously

Critical Insight

By running load tests continuously instead of on-demand, Slack eliminated setup friction, caught regressions in deploy pipelines, and built performance testing into their engineering culture.

The team shares specific architectural decisions around DynamoDB, Kubernetes pod resilience, and the communication strategy that made their rollout completely anticlimactic.

About This Article

Problem

Slack's load testing infrastructure stored everything in memory on Kubernetes pods. Whenever the Keeper pod restarted for updates or security patches, all the test state disappeared. This made it impossible to run continuous tests.

Solution

Slack's team added AWS DynamoDB as a persistent database behind Koi Pond. Now the state survives pod restarts. The flexible schema lets them store tokens, configuration files, and behavior data without constant schema changes.

Impact

With database persistence, Slack could recreate a complex performance incident in one hour instead of spending a week on it. Teams could also verify their fixes before customers ran into the same problems again.

Continuous Load Testing

Article Summary

Key Takeaways

About This Article

Recent from Slack

Related Articles

Related Articles

Performance monitoring in Mercari mobile apps

Mercari tracks mobile app performance live to stay quick and steady.

Mercari • Nov 29, 2023

Why Mobile Application Performance Testing Is Key to App Success

Amazon shows why testing app speed is make-or-break for success.

Amazon • Sep 12, 2023

Distributed Load Testing Using Locust

Glance tests their app’s limits with Locust for rock-solid reliability.

Glance • Jun 15, 2023

DragonCrawl: Generative AI for High-Quality Mobile Testing

Uber’s AI-powered DragonCrawl makes mobile testing sharper and more efficient.

Uber • May 10, 2023

Continuous Load Testing

Article Summary

Key Takeaways

About This Article

Recent from Slack

Android VPAT journey

Mobile Developer Experience at Slack

Scaling Slack’s Mobile Codebases: Modernization

Scaling Slack’s Mobile Codebases: Modularization

Related Articles

Performance monitoring in Mercari mobile apps

Why Mobile Application Performance Testing Is Key to App Success

Distributed Load Testing Using Locust

DragonCrawl: Generative AI for High-Quality Mobile Testing