Distributed Job Scheduler: Zero to 20k Concurrent Jobs
Article Summary
Glance built a distributed job scheduler that went from zero to handling 20,000+ concurrent jobs. Here's how they did it with Redis and smart architecture.
Ricky Mondal, Senior Engineer at Glance, shares the technical journey of building a fault-tolerant job scheduler to fetch and process real-time content from multiple publishing partners. The system needed to handle millions of tasks across distributed nodes while maintaining high availability.
Key Takeaways
- Used Redis sorted sets, pub/sub, and Lua scripts for atomic state transitions
- Broke workflows into dedicated queues: XML parsing, content creation, asset upload, moderation
- Implemented polling workers checking every second plus real-time notifications via Redis
- Built retry mechanisms with exponential backoff and distributed locking to prevent duplicate execution
- Achieved parallel and sequential processing across multiple worker instances for scalability
The team scaled from basic queue processing to 20k+ concurrent jobs by combining Redis primitives, dedicated worker queues, and robust fault tolerance mechanisms.
About This Article
Glance had a problem where multiple workers in their distributed system could pick up the same job at the same time, leading to duplicate processing and inconsistent data.
Ricky Mondal's team added a locking mechanism with expiry times that workers would grab when taking on a job. They also used Lua scripts to handle state changes atomically, which prevented race conditions.
With the locks and Lua scripts in place, duplicate job execution stopped happening and the system stayed consistent. This let them scale up to 20,000+ concurrent jobs without data corruption or conflicts.