Designing Resilient Systems Beyond Retries (Part 1: Rate Limiting) Summary & Key Takeaways

Article Summary

Grab's engineering team learned the hard way: retries and circuit breakers aren't enough when you're running hundreds of microservices at scale.

This is part one of a three-part series from Grab's engineering team on building resilient distributed systems. Michael Cartmell digs into why rate limiting is your critical second line of defense when retry storms threaten to take down your backend.

Key Takeaways

Retry storms can overwhelm servers during failures, making problems worse instead of better
Four-tier rate limiting: per-client-per-endpoint, per-client, per-endpoint, and server-wide
Global rate limiting beats local: prevents downstream resource exhaustion as you scale
Client-side limiting reduces server overhead but adds complexity to SDK implementations

Critical Insight

Rate limiting protects your servers when client-side circuit breakers fail or are misconfigured, preventing cascading failures across your microservices architecture.

The article reveals why Grab built their own Quotas service and how they solved the single point of failure problem for global rate limiting.

Designing Resilient Systems Beyond Retries (Part 1: Rate Limiting)

Article Summary

Key Takeaways

Recent from Grab

Related Articles

Related Articles

Our Buildkite Brings All the Devs to the Yard: (Re)Building Reddit Mobile CI in 2025

Reddit Mobile CI overhaul: up to 50% faster build times, improved stability and developer sentiment; details on infra choices and trade-offs. (Reddit)

Posts on Reddit • Aug 1, 2025

App Health Through Metric-Aware Rollouts

DoorDash rolls out updates with an eye on metrics to keep their app healthy.

DoorDash • Jul 15, 2024

Page not found - Engineering at Meta

Meta rolls out key transparency to lock down WhatsApp’s security.

Meta • Jun 22, 2023

Recovering from Crashes with Safe Mode

Lyft’s ‘Safe Mode’ kicks in after crashes to keep their app steady for users.

Lyft • Oct 17, 2022

Designing Resilient Systems Beyond Retries (Part 1: Rate Limiting)

Article Summary

Key Takeaways

Recent from Grab

Cursor at Grab: Adoption and impact

Demystifying user journeys: Revolutionizing troubleshooting with auto tracking

Grab’s Mac Cloud Exit supercharges macOS CI/CD

How We Reduced GrabX SDK Initialisation Time

Related Articles

Our Buildkite Brings All the Devs to the Yard: (Re)Building Reddit Mobile CI in 2025

App Health Through Metric-Aware Rollouts

Page not found - Engineering at Meta

Recovering from Crashes with Safe Mode