On Wednesday 5 December 2018 between 5:06pm and 6:11pm most users couldn’t connect to the GOV.UK Notify API or use the web interface.
What happened
A developer ran the ‘vacuum’ command on the database table which stores an audit of every message that Notify has sent. Vacuum frees up unused space, but needs an ‘exclusive lock’ on the table. An exclusive lock means that nothing can add new rows to the table until the command finishes.
Because this table has over 200 million rows the command ran very slowly. While it was running no new messages could be sent, because sending a message means adding a row to the audit table. Each attempt to add a new row to the table opened a new database connection and waited for 30 seconds before giving up. This caused the number of connections to rise, until it reached the limit.
The developer ran the vacuum command because they thought they were logged into to the staging database. Staging is a separate environment where we deploy changes to test they work before we put them live. They were running the vacuum command to reduce the size of the database, making it easier to move it to a smaller, cheaper server. Running the vacuum command on the staging database wouldn’t have affected live traffic.
How we responded
5:06pm: We were alerted to ‘QueuePool limit’ errors, meaning our applications were having trouble connecting to the database.
5:31pm: We restarted our applications, which helped for a bit because it freed up some database connections.
5:46pm: We reverted the most recent change to our code, in case it had caused the problem. This made no difference.
6:11pm: We restarted the database. This stopped the vacuum operation, resolving the issue.
Steps taken to prevent this happening again
We are working on:
• Making the permission to run vacuum (and other commands that could degrade performance) off by default on the live database
• Updating our tools to make it clearer which database a developer is connected to
• Getting direct access to the database logs (at the moment we have to speak to the GOV.UK Platform as a Service team)
• A better way of storing our audit data than in one large table
We know that Notify’s reliability is its most important feature. This problem caused the longest downtime we’ve had in 2½ years of operation. We take any interruption to our service seriously, and we’re working on the above mitigations now. We’re really sorry for the disruption this has caused to you and your users. If you have any concerns please get in touch with us through our support page.
Chris Hill-Scott
GOV.UK Notify team