Surviving Production: The Performance Gauntlet
Article Summary
Berkay Özdemir from Getir's Market Discovery Team cut category page load times from 8 seconds to 2 seconds. Then production traffic hit, and everything started crashing.
This is Part 2 of Getir's architecture transformation story. After rebuilding their discovery platform with event-driven architecture and denormalized MongoDB, the team faced a brutal reality check: OOMKills, pod crashes, connection pool storms, and failed deployments under real-world load.
Key Takeaways
- OOMKills and CPU throttling caused pods to crash during traffic spikes
- Readiness probes marked pods healthy before JVM warm-up completed
- Connection pool exhaustion triggered cascading timeouts and latency spikes
- Concurrent pagination reduced heap usage and dropped GC pressure significantly
- Isolated liveness probes prevented Kubernetes from killing saturated but healthy pods
Category pages now load in 2 seconds (P90) with zero cascading failures, but only after systematically fixing probe configurations, connection pools, memory allocation, and CPU-intensive serialization.
About This Article
Getir's Discovery platform had memory problems because category detail endpoints loaded all products at once instead of using pagination. This caused frequent garbage collection cycles and Stop-The-World pauses, which increased the risk of OOMKill failures when the system was under heavy load.
The team added server-side pagination at the subcategory level. Instead of loading all products into memory, category detail requests now run as multiple concurrent paginated queries that the orchestrator assembles together.
Heap utilization dropped noticeably and garbage collection pressure fell significantly. The platform kept 2-second P90 response times while handling much higher throughput. Reliability improved and there were no more cascading failures from upstream outages.