Intermittent errors being returned by API and admin interface

Incident Report for GOV.UK Notify

Postmortem

Firstly, I'm very sorry for the impact this disruption had on your team and the users of your services. We take our performance and resilience extremely seriously and are obviously very disappointed with this outage.

The issues over the days preceding this outage were all very much related and we believe now fully resolved. Here's a brief overview which provides context and remediation we undertook.

We’d been battling incidents over the prior few days as email volumes stepped up to north of 7 million a day, from a typical 1 million.

Ultimately the issue was with a check we did on daily sending totals to ensure the safety limits set per service hadn't been exceeded. It was usually a cached value and we incremented the count in the cache as each message passed through, returning only occasionally to the database to count rows to ensure it was accurate.

Due to the huge volume in the messages table for our largest user (our main database table had quadrupled in size) that count was timing out when it did go back to the database. Meaning once it timed out for the first time, every subsequent message sent would then also try and count the daily total from the database (and continue timing out)

Eventually this led to full cpu and memory usage and an outage. Restarting the database was fixing it for a bit but it would ultimately return.

During the outage we added some enhanced query monitoring to the database and were then able to identify the cause.

We have removed this safety check from the application.

It was very hard to catch and our soak testing never played a scenario when a 7-fold increase in daily platform traffic would come from a single service, which exacerbated the issue.

Since we fixed it we have seen stable service, sending over 50 million messages under huge load with no further issues, so we are confident everything is resolved and significant additional capacity remains available.

We continue to work very hard to maintain and increase performance and capacity to meet this new level of demand - and help you keep your users informed.

Pete

GOV.UK Notify

Posted Mar 27, 2020 - 08:47 GMT

Resolved

Service continues to be stable under load. This incident has been resolved. We'll share an incident review shortly.

Posted Mar 18, 2020 - 08:31 GMT

Monitoring

We believe we have fixed the issue and GOV.UK Notify is operational again.

We are aware that during this time many requests to send notifications failed. For requests that we returned a 201 HTTP response, these have all been sent now.

We are currently processing a small backlog of delivery receipts for notifications which we expect to be completed by 4:30PM.

We apologise for this outage. It was caused by a combination of a bug in our caching layer and unprecedented traffic.

We’ve taken action to mitigate this. We will continue to monitor the situation and will update with a full description of the incident in the coming days.

Thanks,
Leo
GOV.UK Notify

Posted Mar 17, 2020 - 16:13 GMT

Update

We are continuing to work on the issue here, no update at this point, will update more once we know.

Posted Mar 17, 2020 - 15:01 GMT

Identified

The issue has been identified and a fix is being implemented.

Posted Mar 17, 2020 - 14:02 GMT

Monitoring

We've identified the cause and are working on restoring service. Intermittent errors continue. This is a recurrence of the issue that occurred last night. Will update further as we learn more shortly.

Posted Mar 17, 2020 - 13:59 GMT

Investigating

We are currently investigating this issue and will update here shortly once we know more.

Posted Mar 17, 2020 - 13:32 GMT

This incident affected: API, File uploads, Text message sending, Text message delivery receipts, Text message receiving, Email sending, and Email delivery receipts.