Network Latency Issues in Mobile Apps
Article Summary
Anatole Beuzon and Bowen Chen from Datadog turned what seemed like a simple deployment alert into a months-long debugging odyssey. What they uncovered reveals how deceptively complex network latency issues can be.
Datadog's usage estimation service started triggering high-latency alerts on every deployment, regardless of code changes. The team traced the problem through four distinct bottlenecks spanning their entire network stack, from application layer down to the Linux kernel.
Key Takeaways
- Remote cache latency spiked from 100ms to over 1 second during every deployment
- Linux kernel bug forced all traffic through one transmit queue instead of eight
- AWS bandwidth limits silently dropped packets at the hypervisor level
- Fixes enabled 6x scale-down of remote cache, saving hundreds of thousands annually
What appeared to be a simple network issue required fixing an Envoy CPU bottleneck, patching a Linux kernel bug, migrating EC2 instances, and implementing graceful pod shutdown hooks.
About This Article
During rollouts, Datadog's counter application saw p99 remote cache latency swing between 300ms and 1 second after allocating CPU to Envoy. The team realized the issue wasn't just about scaling. Multiple bottlenecks were stacked up in the network layer.
The team used Datadog's Network Performance Monitoring to dig into each network component. They found a Linux kernel bug that was limiting ENA transmit queues and patched it. Then they switched to network-optimized EC2 instances that offered more bandwidth.
Once they added preStop hooks for graceful pod shutdown, p99 remote cache latency dropped below 100ms consistently. This let them shrink the remote cache infrastructure by 6x and cut hundreds of thousands of dollars from annual costs.