Consistent Caching Mechanism in Titus Gateway
Article Summary
Netflix's Titus container platform hit a wall: their singleton leader couldn't handle the API query load. Here's how they scaled horizontally without breaking consistency guarantees.
Netflix engineers Tomasz Bak and Fabio Kung share how they redesigned Titus Gateway's architecture. They moved from a single leader handling all queries to a distributed caching system that maintains strict read consistency across multiple gateway nodes.
Key Takeaways
- Median latency increased slightly but 99th percentile dropped 90% (292ms to 30ms)
- System now handles 8K queries per second versus 4K limit before collapse
- Custom keep alive protocol ensures clients never see stale data across gateways
- Cache synchronization adds average 4ms delay using monotonic timestamps
- Horizontal scaling achieved without changing API contract or client migration
Critical Insight
Netflix doubled their query capacity while improving tail latencies by 90% through consistent caching that guarantees read your write semantics across distributed gateway nodes.