Designing Resilient Systems Beyond Retries (Part 1: Rate Limiting)
Article Summary
Grab's engineering team learned the hard way: retries and circuit breakers aren't enough when you're running hundreds of microservices at scale.
This is part one of a three-part series from Grab's engineering team on building resilient distributed systems. Michael Cartmell digs into why rate limiting is your critical second line of defense when retry storms threaten to take down your backend.
Key Takeaways
- Retry storms can overwhelm servers during failures, making problems worse instead of better
- Four-tier rate limiting: per-client-per-endpoint, per-client, per-endpoint, and server-wide
- Global rate limiting beats local: prevents downstream resource exhaustion as you scale
- Client-side limiting reduces server overhead but adds complexity to SDK implementations
Critical Insight
Rate limiting protects your servers when client-side circuit breakers fail or are misconfigured, preventing cascading failures across your microservices architecture.