Network Latency Issues in Mobile Apps
Article Summary
Anatole Beuzon and Bowen Chen from Datadog turned what seemed like a simple deployment alert into a months-long debugging odyssey. What they uncovered reveals how deceptively complex network latency issues can be.
Datadog's usage estimation service started triggering high-latency alerts on every deployment, regardless of code changes. The team traced the problem through four distinct bottlenecks spanning their entire network stack, from application layer down to the Linux kernel.
Key Takeaways
- Remote cache latency spiked from 100ms to over 1 second during every deployment
- Linux kernel bug forced all traffic through one transmit queue instead of eight
- AWS bandwidth limits silently dropped packets at the hypervisor level
- Fixes enabled 6x scale-down of remote cache, saving hundreds of thousands annually
Critical Insight
What appeared to be a simple network issue required fixing an Envoy CPU bottleneck, patching a Linux kernel bug, migrating EC2 instances, and implementing graceful pod shutdown hooks.