Debugging High Latency in the Market Store
Article Summary
Grab's ML feature store went from 200ms to 2 second latency. The culprit? A single line of code in an async library.
Grab's engineering team debugged a critical performance issue in Market-Store, their real-time ML feature serving system that powers dynamic pricing and consumer experience. What looked like a simple latency spike turned into a deep dive into Go memory management.
Key Takeaways
- P99 latency jumped 10x from 200ms to 2 seconds during peak traffic
- PPROF heap profiling revealed contexts weren't being cleaned up after tasks completed
- Memory leak traced to one MR changing background contexts to task contexts
- Root cause: uncancelled child contexts prevented garbage collection despite completed work
- Restarting temporarily fixed it, but memory grew even as load decreased
A well-intentioned library update that switched context types without proper cancellation caused a memory leak that degraded API latency by 10x.
About This Article
Market-Store's memory usage kept growing over 12 hours while system load dropped. The memory leak survived service restarts and got worse when traffic spiked.
Grab's team ran PPROF heap profiling and found that child contexts from the Async Library weren't being cancelled after tasks finished. A recent merge request had changed how contexts were initialized without adding proper cleanup code.
The team learned that contexts need to be explicitly cancelled using CancelFunc to avoid garbage collection failures. This became an important lesson for writing concurrent Go applications without memory leaks.