OVH FR2 storage issue causing outages and lack of access to the console

Incident Report for Platform.sh

Postmortem

On Friday, April 9, 2021 at approximately 05:48 UTC production and development environments began serving 502 errors and access to project consoles became unavailable. At the same time our internal monitoring systems began alerting Platform.sh personnel of health errors on the disks that make up the storage underlying the region.

What went wrong, and what actions were taken?

Following the alerts our operations team identified several unresponsive components in the region's storage layer. The storage architecture has redundancy built in, but in this case a number of elements were simultaneously affected. Our operations team added more resources and restarted services to alleviate the issue.

What will we do better?

In addition to the additional storage provisioned as part of the incident response, we are increasing the sensitivity of our internal monitoring to detect issues similar to the one that caused this outage before they reach a critical stage.

Posted Apr 14, 2021 - 14:52 UTC

Resolved

This incident has been resolved.

Posted Apr 09, 2021 - 12:14 UTC

Monitoring

All affected environments are back on line and we are currently monitoring.

Posted Apr 09, 2021 - 11:07 UTC

Investigating

Our operations team are currently recovering the affected environments as the underlying issue has been resolved and being monitored.

Posted Apr 09, 2021 - 10:23 UTC

Update

We are continuing to work on a fix for this issue.

Posted Apr 09, 2021 - 08:51 UTC

Identified

The issue has been identified and a fix is being implemented.

Posted Apr 09, 2021 - 07:59 UTC