Datadog Anatole Beuzon May 26, 2023

Network Latency Issues in Mobile Apps

Article Summary

Anatole Beuzon and Bowen Chen from Datadog turned what seemed like a simple deployment alert into a months-long debugging odyssey. What they uncovered reveals how deceptively complex network latency issues can be.

Datadog's usage estimation service started triggering high-latency alerts on every deployment, regardless of code changes. The team traced the problem through four distinct bottlenecks spanning their entire network stack, from application layer down to the Linux kernel.

Key Takeaways

Critical Insight

What appeared to be a simple network issue required fixing an Envoy CPU bottleneck, patching a Linux kernel bug, migrating EC2 instances, and implementing graceful pod shutdown hooks.

The team shares specific AWS ENA metrics and Kubernetes configurations that could have caught these issues earlier and saved dozens of debugging hours.

About This Article

Problem

During rollouts, Datadog's counter application saw p99 remote cache latency swing between 300ms and 1 second after allocating CPU to Envoy. The team realized the issue wasn't just about scaling. Multiple bottlenecks were stacked up in the network layer.

Solution

The team used Datadog's Network Performance Monitoring to dig into each network component. They found a Linux kernel bug that was limiting ENA transmit queues and patched it. Then they switched to network-optimized EC2 instances that offered more bandwidth.

Impact

Once they added preStop hooks for graceful pod shutdown, p99 remote cache latency dropped below 100ms consistently. This let them shrink the remote cache infrastructure by 6x and cut hundreds of thousands of dollars from annual costs.