Grab Aug 19, 2021

Preventing Pipeline Calls from Crashing Redis Clusters

Article Summary

Grab lost 95% of ride bookings for a full minute when a single Redis slave node failed. Their highly available cluster with multiple replicas somehow became a single point of failure.

Grab's engineering team dissects a production outage in Apollo, their critical driver state machine service. The post-mortem reveals how Redis Cluster pipelining and the Go-Redis client created an unexpected vulnerability despite redundant infrastructure.

Key Takeaways

Critical Insight

A seemingly resilient Redis Cluster became brittle because pipeline error handling and client-side state refresh delays turned one slave failure into a cascading system outage.

The fix involves a counterintuitive configuration change that actually increases load on master nodes but prevents total failures.

About This Article

Problem

Apollo's pipeline implementation in Go-Redis sent each HMGET command to a specific node. When any slave node went down, the entire batch failed, even though the Redis Cluster kept running with 3 shards and 2 slaves per shard.

Solution

Grab's engineering team set up a separate Redis Cluster client with RouteByLatency enabled for pipelining. It routes reads through master nodes when slave latency goes above 1ms, so queries keep working as long as the majority partition stays up.

Impact

Grab added per-query error checking in pipeline responses instead of failing the whole batch on any error. This let them isolate failures to individual commands and stop the 95% failure rate that happened during single node outages.

Recent from Grab

Related Articles