Getir • Berkay Özdemir • May 13, 2026

Surviving Production: The Performance Gauntlet

Article Summary

Berkay Özdemir from Getir's Market Discovery Team cut category page load times from 8 seconds to 2 seconds. Then production traffic hit, and everything started crashing.

This is Part 2 of Getir's architecture transformation story. After rebuilding their discovery platform with event-driven architecture and denormalized MongoDB, the team faced a brutal reality check: OOMKills, pod crashes, connection pool storms, and failed deployments under real-world load.

Key Takeaways

OOMKills and CPU throttling caused pods to crash during traffic spikes
Readiness probes marked pods healthy before JVM warm-up completed
Connection pool exhaustion triggered cascading timeouts and latency spikes
Concurrent pagination reduced heap usage and dropped GC pressure significantly
Isolated liveness probes prevented Kubernetes from killing saturated but healthy pods

Critical Insight

Category pages now load in 2 seconds (P90) with zero cascading failures, but only after systematically fixing probe configurations, connection pools, memory allocation, and CPU-intensive serialization.

The team made deliberate trade-offs to achieve this stability, including accepting eventual consistency and single-service risk with specific mitigation strategies.

About This Article

Problem

Getir's Discovery platform had memory problems because category detail endpoints loaded all products at once instead of using pagination. This caused frequent garbage collection cycles and Stop-The-World pauses, which increased the risk of OOMKill failures when the system was under heavy load.

Solution

The team added server-side pagination at the subcategory level. Instead of loading all products into memory, category detail requests now run as multiple concurrent paginated queries that the orchestrator assembles together.

Impact

Heap utilization dropped noticeably and garbage collection pressure fell significantly. The platform kept 2-second P90 response times while handling much higher throughput. Reliability improved and there were no more cascading failures from upstream outages.