Slack Apr 28, 2022

Continuous Load Testing

Article Summary

Slack's engineering team ran into a classic performance testing problem: spinning up load tests was so time-consuming that teams avoided doing it. Their solution? Never stop testing.

Slack's Performance Infrastructure team built Koi Pond, a load testing platform that simulates hundreds of thousands of users. This article details how they evolved it into a continuous system that runs 24/7, testing against an organization 4x larger than their biggest customer.

Key Takeaways

Critical Insight

By running load tests continuously instead of on-demand, Slack eliminated setup friction, caught regressions in deploy pipelines, and built performance testing into their engineering culture.

The team shares specific architectural decisions around DynamoDB, Kubernetes pod resilience, and the communication strategy that made their rollout completely anticlimactic.

About This Article

Problem

Slack's load testing infrastructure stored everything in memory on Kubernetes pods. Whenever the Keeper pod restarted for updates or security patches, all the test state disappeared. This made it impossible to run continuous tests.

Solution

Slack's team added AWS DynamoDB as a persistent database behind Koi Pond. Now the state survives pod restarts. The flexible schema lets them store tokens, configuration files, and behavior data without constant schema changes.

Impact

With database persistence, Slack could recreate a complex performance incident in one hour instead of spending a week on it. Teams could also verify their fixes before customers ran into the same problems again.

Recent from Slack

Related Articles