EU.platform.sh: issues with new deployments

Incident Report for Platform.sh

Postmortem

On November 17th, the Europe West Professional region suffered downtime from 08:23 to 09:05 UTC. Enterprise websites were unaffected.

On November 16th, after noticing a slowness in the orchestration database, a configuration change was applied. When the database was restarted, the data on one of the nodes was corrupted. This ultimately led to a deployments outage (not websites) until the database went back up after fetching the data from other nodes.

On Nov 17th, following the configuration change, we found that the orchestration layer was low on memory and not fully functional. Thus, we allocated more computing resources to the service. While the service was restarting to get the additional computing resources, the gateway layer crashed. The gateway layer automatically recovered right after that but it could not retrieve network configuration of containers from the orchestration layer which was restarting. As a result, the connection between the gateway layer and the containers could not be established. The containers were working, but unreachable from the outside world.

Once the orchestration layer was fully recovered after the restart, the gateway layer picked up the network configuration and everything returned to normal.

Posted Nov 20, 2017 - 08:52 UTC

Resolved

This incident has been resolved.

Posted Nov 17, 2017 - 12:54 UTC

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Nov 17, 2017 - 09:06 UTC

Update

The outstanding issue in the orchestration software leads to connectivity issue to containers running on all hosts. We're sorry for this critical service interruption.

Posted Nov 17, 2017 - 07:34 UTC

Identified

The operation team identified issues in the orchestration software. New deployments on the region are blocked currently.

Posted Nov 17, 2017 - 07:34 UTC