Preventing Pipeline Calls from Crashing Redis Clusters
Article Summary
Grab lost 95% of ride bookings for a full minute when a single Redis slave node failed. Their highly available cluster with multiple replicas somehow became a single point of failure.
Grab's engineering team dissects a production outage in Apollo, their critical driver state machine service. The post-mortem reveals how Redis Cluster pipelining and the Go-Redis client created an unexpected vulnerability despite redundant infrastructure.
Key Takeaways
- Single slave failure caused 95%+ failure rate despite 3 shards with 2 slaves each
- Go-Redis client only refreshes node states every 60 seconds, causing prolonged outages
- Pipeline calls fail entirely if any single command in the batch hits unreachable node
- ReadOnly flag routes all reads to slaves, creating hidden dependency on every slave node
A seemingly resilient Redis Cluster became brittle because pipeline error handling and client-side state refresh delays turned one slave failure into a cascading system outage.
About This Article
Apollo's pipeline implementation in Go-Redis sent each HMGET command to a specific node. When any slave node went down, the entire batch failed, even though the Redis Cluster kept running with 3 shards and 2 slaves per shard.
Grab's engineering team set up a separate Redis Cluster client with RouteByLatency enabled for pipelining. It routes reads through master nodes when slave latency goes above 1ms, so queries keep working as long as the majority partition stays up.
Grab added per-query error checking in pipeline responses instead of failing the whole batch on any error. This let them isolate failures to individual commands and stop the 95% failure rate that happened during single node outages.