How Meetup Scales Notification Queue Consumers
Article Summary
Meetup sends 8-10 million notifications daily. Their queue kept backing up, sending messages late or not at all.
Elle Mundy, an SRE at Meetup, shares how her team debugged their AWS SQS autoscaling strategy through three iterations. What started as a simple metric swap turned into a calculus problem that exposed deeper architectural issues.
Key Takeaways
- First attempt: scaled on oldest message age, wasted money on stuck messages
- Second try: queue depth worked better but scaled too late during spikes
- Final solution: custom Lambda calculates load to capacity ratio every minute
- New metric revealed they needed 3x more consumer tasks than expected
- Exposed hidden bottleneck: ran out of lock keys for deduplication
By creating a custom metric that divides messages sent by messages received, Meetup now scales proactively before queues back up instead of reacting after notifications are already delayed.
About This Article
Meetup's notification system couldn't keep up when cron jobs suddenly spiked traffic. The autoscaling system reacted too slowly, and messages piled up in queues faster than they could be processed. This resulted in late or missing notifications across millions of daily messages.
Elle Mundy's team built a custom Lambda function that tracks a load-to-capacity ratio by dividing NumberOfMessagesSent by NumberOfMessagesReceived every minute. This metric let them scale proactively instead of reactively, with step adjustments that got more aggressive for extreme spikes.
The new metric showed that Meetup actually needed three times more consumer tasks than they thought. It also exposed a bottleneck in the architecture. When the system scaled up, it ran out of lock keys for deduplication. They had to expand the cluster to fix it.