Grab: Preventing Pipeline Calls from Crashing Redis Clusters

Article Summary

Grab lost 95% of ride bookings for a full minute when a single Redis slave node failed. Their highly available cluster with multiple replicas somehow became a single point of failure.

Grab's engineering team dissects a production outage in Apollo, their critical driver state machine service. The post-mortem reveals how Redis Cluster pipelining and the Go-Redis client created an unexpected vulnerability despite redundant infrastructure.

Key Takeaways

Single slave failure caused 95%+ failure rate despite 3 shards with 2 slaves each
Go-Redis client only refreshes node states every 60 seconds, causing prolonged outages
Pipeline calls fail entirely if any single command in the batch hits unreachable node
ReadOnly flag routes all reads to slaves, creating hidden dependency on every slave node

Critical Insight

A seemingly resilient Redis Cluster became brittle because pipeline error handling and client-side state refresh delays turned one slave failure into a cascading system outage.

The fix involves a counterintuitive configuration change that actually increases load on master nodes but prevents total failures.

About This Article

Problem

Apollo's pipeline implementation in Go-Redis sent each HMGET command to a specific node. When any slave node went down, the entire batch failed, even though the Redis Cluster kept running with 3 shards and 2 slaves per shard.

Solution

Grab's engineering team set up a separate Redis Cluster client with RouteByLatency enabled for pipelining. It routes reads through master nodes when slave latency goes above 1ms, so queries keep working as long as the majority partition stays up.

Impact

Grab added per-query error checking in pipeline responses instead of failing the whole batch on any error. This let them isolate failures to individual commands and stop the 95% failure rate that happened during single node outages.

Preventing Pipeline Calls from Crashing Redis Clusters

Article Summary

Key Takeaways

About This Article

Recent from Grab

Related Articles

Related Articles

Our Buildkite Brings All the Devs to the Yard: (Re)Building Reddit Mobile CI in 2025

Reddit Mobile CI overhaul: up to 50% faster build times, improved stability and developer sentiment; details on infra choices and trade-offs. (Reddit)

Posts on Reddit • Aug 1, 2025

App Health Through Metric-Aware Rollouts

DoorDash rolls out updates with an eye on metrics to keep their app healthy.

DoorDash • Jul 15, 2024

Page not found - Engineering at Meta

Meta rolls out key transparency to lock down WhatsApp’s security.

Meta • Jun 22, 2023

Recovering from Crashes with Safe Mode

Lyft’s ‘Safe Mode’ kicks in after crashes to keep their app steady for users.

Lyft • Oct 17, 2022

Preventing Pipeline Calls from Crashing Redis Clusters

Article Summary

Key Takeaways

About This Article

Recent from Grab

Cursor at Grab: Adoption and impact

Demystifying user journeys: Revolutionizing troubleshooting with auto tracking

Grab’s Mac Cloud Exit supercharges macOS CI/CD

How We Reduced GrabX SDK Initialisation Time

Related Articles

Our Buildkite Brings All the Devs to the Yard: (Re)Building Reddit Mobile CI in 2025

App Health Through Metric-Aware Rollouts

Page not found - Engineering at Meta

Recovering from Crashes with Safe Mode