On Friday, April 9, 2021 at approximately 05:48 UTC production and development environments began serving 502 errors and access to project consoles became unavailable. At the same time our internal monitoring systems began alerting Platform.sh personnel of health errors on the disks that make up the storage underlying the region.
What went wrong, and what actions were taken?
Following the alerts our operations team identified several unresponsive components in the region's storage layer. The storage architecture has redundancy built in, but in this case a number of elements were simultaneously affected. Our operations team added more resources and restarted services to alleviate the issue.
What will we do better?
In addition to the additional storage provisioned as part of the incident response, we are increasing the sensitivity of our internal monitoring to detect issues similar to the one that caused this outage before they reach a critical stage.