Continuous Load Testing
Article Summary
Slack's engineering team ran into a classic performance testing problem: spinning up load tests was so time-consuming that teams avoided doing it. Their solution? Never stop testing.
Slack's Performance Infrastructure team built Koi Pond, a load testing platform that simulates hundreds of thousands of users. This article details how they evolved it into a continuous system that runs 24/7, testing against an organization 4x larger than their biggest customer.
Key Takeaways
- Automated token generation replaced a manual hours-long process prone to errors
- Automatic Shutdown service monitors metrics and kills tests below 95% success rate
- Continuous testing caught performance bugs before production with zero off-hours pages
- Feature coverage increased 10% as product teams adopted always-on testing
- Cost stayed low despite running 500,000 simulated clients continuously
By running load tests continuously instead of on-demand, Slack eliminated setup friction, caught regressions in deploy pipelines, and built performance testing into their engineering culture.
About This Article
Slack's load testing infrastructure stored everything in memory on Kubernetes pods. Whenever the Keeper pod restarted for updates or security patches, all the test state disappeared. This made it impossible to run continuous tests.
Slack's team added AWS DynamoDB as a persistent database behind Koi Pond. Now the state survives pod restarts. The flexible schema lets them store tokens, configuration files, and behavior data without constant schema changes.
With database persistence, Slack could recreate a complex performance incident in one hour instead of spending a week on it. Teams could also verify their fixes before customers ran into the same problems again.