Grab Oct 30, 2024

How we reduced peak memory and CPU usage of the product configuration management SDK

M2 Related OWASP risk: Inadequate Supply Chain Security Learn more →

Article Summary

Grab's configuration management SDK was crushing service performance with memory spikes and CPU throttling. 98% of services were loading 100x more data than they actually needed.

The Grab engineering team tackled their GrabX SDK's resource consumption problem that was blocking wider adoption across their 700+ city superapp platform. They analyzed how services were reading configuration data and found massive inefficiencies in their monolithic JSON approach.

Key Takeaways

Critical Insight

By shifting from a single monolithic JSON file to service-partitioned configs with changelogs, Grab cut SDK memory usage by 70% and CPU by over 50%.

The article reveals how a simple analysis of read patterns exposed that nearly all services were carrying 99% dead weight in memory.

About This Article

Problem

The GrabX SDK required clients to download and decode a single JSON file over 100MB in size every minute. This file contained configurations for all services, but 98% of services only needed less than 1% of the data. The large downloads caused CPU throttling that pushed P99 latency higher.

Solution

Grab split the configuration data into separate JSON files organized by service. They also added service-level changelog files so SDKs could subscribe to only the services they needed and pull incremental updates instead of downloading the full file each time.

Impact

Memory utilization dropped by up to 70%. Maximum CPU utilization fell by more than 50%. The system now handles 5,500 read requests per second per configuration prefix on S3.