Firstly, I'm very sorry for the impact this disruption had on your team and the users of your services. We take our performance and resilience extremely seriously and are obviously very disappointed with this outage.
The issues over the days preceding this outage were all very much related and we believe now fully resolved. Here's a brief overview which provides context and remediation we undertook.
We’d been battling incidents over the prior few days as email volumes stepped up to north of 7 million a day, from a typical 1 million.
Ultimately the issue was with a check we did on daily sending totals to ensure the safety limits set per service hadn't been exceeded. It was usually a cached value and we incremented the count in the cache as each message passed through, returning only occasionally to the database to count rows to ensure it was accurate.
Due to the huge volume in the messages table for our largest user (our main database table had quadrupled in size) that count was timing out when it did go back to the database. Meaning once it timed out for the first time, every subsequent message sent would then also try and count the daily total from the database (and continue timing out)
Eventually this led to full cpu and memory usage and an outage. Restarting the database was fixing it for a bit but it would ultimately return.
During the outage we added some enhanced query monitoring to the database and were then able to identify the cause.
We have removed this safety check from the application.
It was very hard to catch and our soak testing never played a scenario when a 7-fold increase in daily platform traffic would come from a single service, which exacerbated the issue.
Since we fixed it we have seen stable service, sending over 50 million messages under huge load with no further issues, so we are confident everything is resolved and significant additional capacity remains available.
We continue to work very hard to maintain and increase performance and capacity to meet this new level of demand - and help you keep your users informed.
Pete
GOV.UK Notify