Meetup Elle Mundy Oct 18, 2022

How Meetup Scales Notification Queue Consumers

Article Summary

Meetup sends 8-10 million notifications daily. Their queue kept backing up, sending messages late or not at all.

Elle Mundy, an SRE at Meetup, shares how her team debugged their AWS SQS autoscaling strategy through three iterations. What started as a simple metric swap turned into a calculus problem that exposed deeper architectural issues.

Key Takeaways

Critical Insight

By creating a custom metric that divides messages sent by messages received, Meetup now scales proactively before queues back up instead of reacting after notifications are already delayed.

The solution worked so well it immediately broke something else in their system (and forced them to rethink their entire ECS cluster strategy).

About This Article

Problem

Meetup's notification system couldn't keep up when cron jobs suddenly spiked traffic. The autoscaling system reacted too slowly, and messages piled up in queues faster than they could be processed. This resulted in late or missing notifications across millions of daily messages.

Solution

Elle Mundy's team built a custom Lambda function that tracks a load-to-capacity ratio by dividing NumberOfMessagesSent by NumberOfMessagesReceived every minute. This metric let them scale proactively instead of reactively, with step adjustments that got more aggressive for extreme spikes.

Impact

The new metric showed that Meetup actually needed three times more consumer tasks than they thought. It also exposed a bottleneck in the architecture. When the system scaled up, it ran out of lock keys for deduplication. They had to expand the cluster to fix it.