Overview
On April 15, at 07:15 UTC an incident was declared on the us-2 region with multiple projects becoming inaccessible by way of the Console UI as well as project environments unable to serve traffic.
What Happened
At 05:25 UTC our internal monitoring detected a spike in disk activity on our storage layer resulting in stuck processes. We immediately began an investigation and started remediation of the underlying cause.
Resolution
At approximately 08:11 UTC, in order to bring our storage service back to normal operating levels, we began rotating existing host machines out of the region with new hosts. To be clear, not all host machines were affected. We refreshed the region with newly provisioned host machines out of an abundance of caution to restore stability to the region. This operation concluded at 09:14 UTC.
We identified the root cause as a bug on our storage layer. We deployed a patch to fix this and completed rolling it out to all our regions on April 19th.
Impact
The incident lasted a total of 4 hours with the longest detected project outage lasting 2 hours and 53 minutes. Most projects were recovered in under 1 hour.