Netflix Tomasz Bak and Fabio Kung Nov 13, 2018

Consistent Caching Mechanism in Titus Gateway

Article Summary

Netflix's Titus container platform hit a wall: their singleton leader couldn't handle the API query load. Here's how they scaled horizontally without breaking consistency guarantees.

Netflix engineers Tomasz Bak and Fabio Kung share how they redesigned Titus Gateway's architecture. They moved from a single leader handling all queries to a distributed caching system that maintains strict read consistency across multiple gateway nodes.

Key Takeaways

Critical Insight

Netflix doubled their query capacity while improving tail latencies by 90% through consistent caching that guarantees read your write semantics across distributed gateway nodes.

The article reveals how they use dummy messages and logical timestamps to solve a tricky distributed systems problem that most caching solutions ignore.

About This Article

Problem

The Titus Job Coordinator, a singleton leader process, became a bottleneck when handling all query requests across Netflix's container platform. Response latencies spiked dangerously as load approached 4.5K queries per second.

Solution

Tomasz Bak and Fabio Kung built a keep-alive synchronization protocol using high-resolution monotonic timestamps. This lets Titus Gateway nodes maintain consistent local caches without polling the leader for every request.

Impact

The system now scales linearly to 8K queries per second. The 80th percentile latency dropped from 336ms to 22ms. Dummy messages inserted into event streams guarantee keep-alive acknowledgment within a bounded 2ms interval.