Rails has added
jitter to ActiveJob::Exceptions.retry_on to prevent the thundering herd effect.
Before Rails 6
Thundering herd effect in the context of ActiveJob job failures
When a job fails, the
ActiveJob::Exceptions.retry_on catches the exception for the failed job and reschedules the job with a specified delay. The specified delay can be seconds (ActiveSupport::Duration is also allowed),
exponentially_longer or a block which has several executions as an argument.
Consider a situation where a lot of jobs fail around the same time. The reason can be anything, a network error or load on the database. The exception handler will catch the error for each job and reschedule with a static wait time.
The problem with static wait time is, the reschedule may happen around the same time and the jobs will be retried at the provided interval. With many failed jobs being tried at the same time causing the thundering herd effect, it may easily bring down the system resulting in jobs failing again. The jobs will be retried again in a static interval causing similar results until the retry limit is exhausted.
With Rails 6
The thundering herd effect can be taken care of with a custom
wait argument block. But with many users not being aware that a situation like this may happen, a
jitter was introduced by default which helps in randomizing the
wait time is randomized with
jitter depending on the type of
wait argument provided. Essentially, a random slight delay calculated from
jitter value is added to the
wait time. Note that the
jitter will only be applied when an
exponentially_longer, is passed to the
wait argument in
jitter value defaults 15% (represented as 0.15), but can be overridden by passing it as an argument,
retry_on CustomAppException, wait: 5.seconds, attempts: 3, jitter: 0.30
There is also a provision to add default
jitter value at the config level (defaults to 0.15),
config.active_job.retry_jitter = 0.45
If you do not wish to add
jitter, it can be disabled by providing a zero/false value,
retry_on CustomAppException, wait: 5.seconds, attempts: 3, jitter: 0 # or false
Or at config level,
config.active_job.retry_jitter = false # or 0