Scaling Cron Scripts with a case study from Slack

Slack is a messaging platform for efficient team collaboration. Their success depends on the right message reaching the right person on time and therefore notifications are very important to them. This also means that a lot of their functionalities rely on Cron Scripts. These scripts ensure:

Timely reminders.
Email notifications.
Message notifications.
Database clean-up.

Cron jobs are used to schedule and automate repetitive tasks. These jobs ensure that specific scripts or commands run at predefined intervals without any sort of manual intervention.

As the platform expanded, so was the number of cron scripts. This also led to increase in amount of data processed by the scripts.This led to a dip in the reliability of the overall execution environment.

The issues:

Below are the issues that Slack were facing:

A single node executed all the scripts locally. It kept a copy of the scripts and one crontab file with the schedules. At scale, this solution wasn't easy to maintain.
Vertical scaling of the node by adding more CPU and more RAM to support a higher number of cron scripts became cost in-effective.
The individual node was a single point of failure. Any configuration issues could bring down critical Slack functionality.

To solve this issues, it was decided to build a more reliable and scalable cron execution service.

The system Components.

The new cron service had three main components:

Scheduled Job Conductor.

This new service was written on Go and deployed on Bedrock (Slack's in-house wrapper around Kubernetes).

It basically mimics the cron behaviour by using a Go-based cron library. Deploying it on Bedrock allows them to scale up multiple pods easily. Only one pods is tasked with running the scheduling while others remain on standby mode.

While this may feel like having intentionally a single point of failure, the slack team felt that synchronizing the nodes would be a bigger headache. This was supported by two additional points:

Pods can switch leaders easily in case the leader goes down. This made the downtime quite unlikely.
They offload all the memory and CPU-intensive work of running the scripts to Slack's Job Queue. The pod is just for scheduling.

Below image shows how it works:

Slack Job's queue.

The Job queue is an existing component that serves a bunch of requirements at Slack. This is an asynchronous compute platform that runs about 9 billion jobs per day and consists of multiple "queues".

These queues are like logical pathways to move jobs through Kafka into Redis where the Job metadata is stored. From Redis, the Job is finally handed over to a job worker. The worker is a node that actually executes the Cron job.

See the below illustration for this:

Since this system was already existing and could handle the compute and memory load, it was easy for the team to adapt it to handle the cron jobs as well.

Vitess database table.

Lastly, they employed Vitess database table to handle the Job data, particularly for two purposes:

Handling deduplication.
Report Job tracking to internal users.

For those of you who may not be aware, Vitess is a scalable MySQL-compatible Cloud-native database.

In the new system, each Job execution is recorded as a new row in a table. Also, the job's status is updated as it moves through various stages (enqueued, in progress, done). The image below shows how it works.

Before starting a new run of a Job, the System checks whether another instance of the Job is not already running. This table also serves as the backend web page for displaying the cron script execution information. It shows the users to look up the state of their scripts runs and any errors that they encounter.

Thank you and see you in the next one.

Special credits to Slack official blog and Saurabh.

Scaling Cron Scripts with a case study from Slack.