6 Lessons Learned from Optimizing the Performance of a Node.js Service
Article Summary
Klarna's A/B testing platform needed single-digit millisecond latency at 99.9%. Their Node.js service was spiking to seconds under load.
The team built a performance testing pipeline to catch issues before production. What they discovered through load testing revealed hidden bottlenecks that standard monitoring completely missed.
Key Takeaways
- DNS resolution created tens of thousands of queued requests from StatsD client
- Batching Kafka messages every second eliminated multi-second response time spikes
- Event loop metrics (Active Requests/Handles) exposed problems CPU/memory didn't show
- Extended 10-minute tests revealed issues that 2-minute tests completely missed
Six optimization lessons transformed a Node.js service from unpredictable multi-second spikes to consistent sub-millisecond performance under sustained load.
About This Article
Itamar B's team found that the StatsD client was resolving the hostname for every outgoing message. This created tens of thousands of queued UV_GETADDRINFO requests that overwhelmed the event loop, even though CPU and memory usage stayed low.
They fixed it by adding proper DNS caching outside the client. Using monkey patching on Node.js's DNS module, they made it respect TTL values. This avoided the StatsD client's indefinite caching, which couldn't handle load balancer redeployment.
The DNS caching fix cut the number of active requests in the queue significantly. It removed the bottleneck that was causing response times to spike for several seconds during sustained load testing at Klarna's A/B testing platform.