After the timezone change from BST to GMT at 2am BST/1am GMT on Sunday 30th October, we saw intermittent issues with some of our regularly-scheduled tasks failing to run. The most prominent impact of this was that scheduled bulk email/text message sends may not have gone out exactly on time, instead going out 15 or 30 minutes later.
We use a piece of software called celery to schedule and run our regular tasks, such as processing bulk notification jobs. There are some open issues on celery around jobs failing to be scheduled when a timezone change (such as daylight savings) happens, which we weren’t previously aware of. As the issues have not been fixed in celery yet, we are not simply able to upgrade the dependency and resolve the problem for the future, so need to explore other options.
To fix the immediate problem of tasks not being scheduled, celery just needs to be restarted. Jobs will then start to be scheduled and run on time again. Our fix during the incident was to re-deploy our celery application, and we will make sure that happens automatically after future timezone changes until the issue is fixed permanently.
We will also improve our monitoring and alerting around scheduled tasks (specifically bulk email/text message sends) running on time so that we can proactively handle the situation in the future. before users notice any significant impact.