Grab Oct 20, 2022

Debugging High Latency in the Market Store

Article Summary

Grab's ML feature store went from 200ms to 2 second latency. The culprit? A single line of code in an async library.

Grab's engineering team debugged a critical performance issue in Market-Store, their real-time ML feature serving system that powers dynamic pricing and consumer experience. What looked like a simple latency spike turned into a deep dive into Go memory management.

Key Takeaways

Critical Insight

A well-intentioned library update that switched context types without proper cancellation caused a memory leak that degraded API latency by 10x.

The fix was simple once found, but the debugging journey reveals why Go context cancellation is non-negotiable at scale.

About This Article

Problem

Market-Store's memory usage kept growing over 12 hours while system load dropped. The memory leak survived service restarts and got worse when traffic spiked.

Solution

Grab's team ran PPROF heap profiling and found that child contexts from the Async Library weren't being cancelled after tasks finished. A recent merge request had changed how contexts were initialized without adding proper cleanup code.

Impact

The team learned that contexts need to be explicitly cancelled using CancelFunc to avoid garbage collection failures. This became an important lesson for writing concurrent Go applications without memory leaks.

Recent from Grab

Related Articles