Zocron

It seems like every web application needs to have some facility for running tasks periodically. Whether you’re maintaining caches, asynchronously consuming queues, or doing dataset analysis, automating work on a schedule is pretty much ubiquitous.

ZocDoc is no exception! We have hundreds of jobs that run throughout the day, so having a service to run these reliably is important. There are plenty of job scheduler options available. Cron is a popular choice, but problematic for us since we mostly use Microsoft technologies. SQL Server Jobs are reliable and proven, but getting SQL to interact with our CLR codebase is not realistic. Windows Scheduled Tasks are hard to configure if you don’t have production credentials. So, for a variety of reasons, we made our own job scheduler which we’ve affectionately named “Zocron”.

Design Philosophy

ZocDoc’s requirements in a job scheduler were not peculiar, but they drove our decision-making when implementing Zocron. These requirements were:

Suitable for a Variety of Tasks

With Zocron, we wanted to put an end to the various, sometimes one-off, job schedulers that had cropped up at ZocDoc. Bringing these all under one roof would afford for a single system to test and would allow all jobs to enjoy enhancements as they’re introduced. The benefits to unifying the scheduling infrastructure are obvious, but we also wanted to avoid making all jobs suffer from dealing with cruft that might come from a few. For example, we didn’t want jobs that simply pinged an http endpoint periodically to have to deal with overhead that would be required by a heavy-duty data processing job.

Consistently Available

ZocDoc has an aggressive code push schedule, meaning we can deploy new code as frequently as once a day. On the website, we need to provide the same functionality to our patients even when code is deploying and services are shut down. The same attitude prevailed for our job scheduler: the services depending on jobs running reliably should not have to worry about Zocron downtime.

Expressive in Scheduling

What good is a job scheduler if you can’t schedule a job to run a fortnight after the first full moon preceding the vernal equinox? Well, maybe that’s too expressive. But we certainly needed something more sophisticated than “run every n minutes”.

Easy to Monitor

A system is only as good as the tools that visualize its activity. Zocron needed to report what it was doing and when, all in an unambiguous way.

With those needs in mind, we went to work developing the infrastructure for running jobs. While developing, we tried to stick to two simple ideals. Firstly, the scheduling infrastructure should be agnostic about the nature of the tasks it runs. Zocron shouldn’t behave differently for long-running jobs and short-lived ones, and it shouldn’t have special cases. Secondly, tasks themselves should be ignorant about what else might be running at the same time. We feel that, by sticking to these ideals, Zocron has given us a platform to do scheduled work that will serve us for a long while.

Figure 1: Real-time job monitoring

Architectural Overview

The Zocron application itself is a .NET application that is installed as a Windows Service. We’ve deployed it to several machines in our production environment, and we push new code for it every day along with our web servers. Each instance of the Zocron application is called an “agent.” Some agents run on heavy-duty machines and maintain large in-memory caches for important datasets, while other agents are more minimally provisioned. Together, this pool of Zocron agents share the workload of our scheduled jobs.

Configuring Zocron jobs is a matter of updating values in a normalized, relational database. Jobs are defined in a ZocronJob table. They are comprised of one or more tasks, themselves defined in a ZocronTask table and related by a ZocronJobStep table. And, since we love the expressiveness of Cron expressions, we store those expressions in a ZocronJobSchedule table. The relational database is what gives Zocron much of its flexibility: a shared initialization task can be reused in many jobs, multiple Cron expressions can be used to provide a rich schedule, and resource demanding jobs can be assigned to more powerful agents. There’s plenty of other minutiae that can be controlled in these tables, and they are all editable from a web interface that maintains an auditable change log.

The database is also where we log execution information. Each time a job is locked by an agent, its ZocronJobId and scheduled time are inserted into a ZocronJobRunLog table. Similarly, as tasks start and end, information is recorded in a ZocronTaskRunLog table. This gives us details about when something starts, finishes, and what the result was. We then expose this information to the dev team through web tools, making it easy to monitor the status of our jobs in real-time. Synchronization

When a Zocron agent starts, it executes a stored procedure in the database to lock some number of eligible jobs. The stored procedure combines information about a job’s schedule (ZocronJobSchedule table) with its history (ZocronJobRunLog table). The query is basically asking, “Are there any jobs that are scheduled to run now that aren’t already running?” It does this with the help of a clever cross apply, a CLR function, and some covering indexes. The magic happens in the code snippet below.

SELECT zj.ZocronJobId, ISNULL(CAST(x.ScheduledDateEt as datetime), @minDateThresholdEt) as LastRunScheduledDateEt INTO #jobHistory FROM ZocronJob zj outer apply ( SELECT top 1 zjrl.ScheduledDateEt, zjrl.EndDateUtc FROM ZocronJobRunLog zjrl with (updlock, holdlock) WHERE zj.ZocronJobId = zjrl.ZocronJobId ORDER BY zjrl.ScheduledDateEt DESC ) xINSERT INTO ZocronJobRunLog (ZocronJobId, ScheduledDateEt, ZocronAgentId, StartDateUtc) SELECT top (@numberToLock) x.ZocronJobId, x.ScheduleToLock as ScheduledDateEt, @zocronAgentId as ZocronAgentId, @nowUtc as StartDateUtc FROM ( SELECT zjs.ZocronJobId, csLast.Occurrence as ScheduleToLock, ROW_NUMBER() over ( PARTITION BY zjs.ZocronJobId ORDER BY csLast.Occurrence DESC ) as rn FROM #jobHistory jh inner join ZocronJobSchedule zjs on jh.ZocronJobId = zjs.ZocronJobId cross apply CrontabLastOccurrence_fn(zjs.CronExpression, jh.LastRunScheduledDateEt, @nowEt) csLast ) x WHERE x.rn = 1

The first statement gets the last run for each job and stores them in a temp table. The second statement then gets the Cron expression schedules for each job and parses them with a CLR function. This function returns the last occurrence of the given expression if one exists between the given date range. Because we use a cross apply, the jobs that aren’t scheduled to run right now get filtered out. The ROW_NUMBER() function is used because it is possible for a single job to have multiple schedules, so we should only select at most one row per job. Finally, we insert the jobs we want to lock into the run log table. The ZocronJobRunLog table serves two important purposes here. First, it is a record of which jobs are executed and when (which will come in handy for our monitoring tool). Second, it indicates that a job scheduled to run at a certain time has been assigned to a specific agent. When a job instance is assigned to an agent, no other agents will attempt to grab that same scheduled instance. Because we can guarantee this, each agent will execute the stored procedure on its own and jobs will get dealt out to them. SQL Server’s transactions and locks make it possible to interleave stored procedure calls without any additional synchronization: the updlock hint on the first statement acquires an exclusive lock, while the holdlock hint persists that lock until the procedure is committed. Therefore, if two agents execute the stored procedure at the same time, one will block the other.

Learnings

ZocDoc has been using Zocron for about a year now. During that time, we’ve learned a lot about job scheduling and made modest enhancements to Zocron.

We learned early that exposing the proper monitoring signals demystifies how Zocron is working. We’ve got a dashboard that lists the jobs that are currently running along with jobs that are scheduled to run but haven’t yet. We look for blocking chains to make sure no agent is holding database locks for long periods of time. And comparing job run durations against previous runs help us identify gradual and sudden slowdowns.

Also, we’ve come to really appreciate the flexibility database-backed configuration gives us. We can open up a web page, change job settings, and see the effects immediately, all without recycling processes or redeploying binaries. If something is not working as expected, that behavior is brought to our attention immediately and we can turn jobs off. Though we can’t foresee everything that might happen when new code is pushed, having everything in the database gives us options to respond to almost any situation.

Lastly, one of the biggest problems that arose for us was maintaining good quality of service when a single job started consuming large amounts of system resources. Our expectation was that jobs running concurrently would be isolated from each other, but it turned out that if one job loaded a dataset large enough to page to disk, all jobs on that agent would suffer huge performance penalties. Our engineers took a good long look at the problematic jobs and applied these good citizenship strategies:

Consume from Queues

When you have a job scheduler, it is tempting to schedule something for midnight that loads every appointment ever and calculates something about each of them. However, that kind of naïve dataset processing can’t scale forever, especially when you are going nationwide. Our best jobs, therefore, consume from a change queue. For any dataset, when an entity is added or changed, a row is inserted into a change log table. Each change is assigned an auto-incrementing identifier. Zocron jobs can then track the value of the last change identifier processed. On the next run of the job, Zocron will pull back the changes with identifier values greater than that of the last processed one. It is a simple but reliable way of detecting which entities have changed and should be reprocessed.

Bound the Working Set Size

Even when tracking changes, though, it is possible to be overwhelmed. If a burst of changes come in, it may not be safe to load all of them up and reprocess them. Therefore, whenever a Zocron job is getting items to work on, we always cap the number that is pulled back. So, if there have been 100,000 changed entities since the last time a Zocron job ran, it will only process the first 10,000 of them. A subsequent run of the job can take care of the next 10,000. The job will just keep restarting until all the changes have been consumed. By batching up work like this, not only can we allow garbage collection to free up the memory from the earlier changes, but we saved our work incrementally, making it easier to recover when failure arises.

More Frequent, Small Runs

But if you limit the amount of work a job can do, will it ever finish doing what it is supposed to do? It will if you schedule it frequently enough. We’ve found that if a job is scheduled to run with small batches frequently throughout the day, it can easily process everything it needs to. Plus, you avoid having a slug of heavy-duty tasks all kicking off at midnight.

After changing our larger jobs to be friendlier to the health of the agent, the reliability of all jobs improved. Today, we’re looking at ways to further isolate jobs from each other, including wild ideas like running jobs in separate app domains.