Outage on us-2.platform.sh

Incident Report for Platform.sh

Postmortem

Overview

On April 15, at 07:15 UTC an incident was declared on the us-2 region with multiple projects becoming inaccessible by way of the Console UI as well as project environments unable to serve traffic.

What Happened

At 05:25 UTC our internal monitoring detected a spike in disk activity on our storage layer resulting in stuck processes. We immediately began an investigation and started remediation of the underlying cause.

Resolution

At approximately 08:11 UTC, in order to bring our storage service back to normal operating levels, we began rotating existing host machines out of the region with new hosts. To be clear, not all host machines were affected. We refreshed the region with newly provisioned host machines out of an abundance of caution to restore stability to the region. This operation concluded at 09:14 UTC.

We identified the root cause as a bug on our storage layer. We deployed a patch to fix this and completed rolling it out to all our regions on April 19th.

Impact

The incident lasted a total of 4 hours with the longest detected project outage lasting 2 hours and 53 minutes. Most projects were recovered in under 1 hour.

Posted Apr 21, 2021 - 23:53 UTC

Resolved

This incident has been resolved.

Posted Apr 15, 2021 - 09:48 UTC

Monitoring

The corrective actions are completed and most projects have recovered. We're monitoring the issue closely to ensure everything is working as expected.

Posted Apr 15, 2021 - 09:03 UTC

Update

The corrective actions are still ongoing. Some projects have recovered and are working as expected.

Posted Apr 15, 2021 - 08:43 UTC

Update

The corrective actions to restore the services availability are still in progress.

Posted Apr 15, 2021 - 07:58 UTC

Identified

We have detected an issue affecting service on the us-2.platform.sh region. We are currently working to restore service.
This issue affects multiple production sites as well as development environments. Access to your site, project UI as well as Git and SSH access may be affected.
This outage does not affect Dedicated Enterprise Clusters.
Our operations and engineering teams are currently working to resolve this outage.

Posted Apr 15, 2021 - 07:15 UTC

This incident affected: USA-2 (East 2) (us-2.platform.sh).