Consistent Caching Mechanism in Titus Gateway
Article Summary
Netflix's Titus container platform hit a wall: their singleton leader couldn't handle the API query load. Here's how they scaled horizontally without breaking consistency guarantees.
Netflix engineers Tomasz Bak and Fabio Kung share how they redesigned Titus Gateway's architecture. They moved from a single leader handling all queries to a distributed caching system that maintains strict read consistency across multiple gateway nodes.
Key Takeaways
- Median latency increased slightly but 99th percentile dropped 90% (292ms to 30ms)
- System now handles 8K queries per second versus 4K limit before collapse
- Custom keep alive protocol ensures clients never see stale data across gateways
- Cache synchronization adds average 4ms delay using monotonic timestamps
- Horizontal scaling achieved without changing API contract or client migration
Netflix doubled their query capacity while improving tail latencies by 90% through consistent caching that guarantees read your write semantics across distributed gateway nodes.
About This Article
The Titus Job Coordinator, a singleton leader process, became a bottleneck when handling all query requests across Netflix's container platform. Response latencies spiked dangerously as load approached 4.5K queries per second.
Tomasz Bak and Fabio Kung built a keep-alive synchronization protocol using high-resolution monotonic timestamps. This lets Titus Gateway nodes maintain consistent local caches without polling the leader for every request.
The system now scales linearly to 8K queries per second. The 80th percentile latency dropped from 336ms to 22ms. Dummy messages inserted into event streams guarantee keep-alive acknowledgment within a bounded 2ms interval.