EU-2 Orchestration system

Incident Report for Platform.sh

Postmortem

This incident was caused by resource contention in our orchestration layer. One of the components had a memory leak which resulted in a host using large amounts of swap which put additional stress on the other services run on the host. As a result, several hosts rebooted as they lost their connection to the orchestration layer. This incident affected one third of the EU-2 region hosts.

Fixes deployed during maintenance on Wednesday November 21st:

Split orchestration components on dedicated hosts
Fix for the memory leak in the coordinator process.
Fix for a bug in an upstream library used in the orchestration software.
Fix for a networking bug preventing project recovery after reallocation.

‌

We are continuing to monitor the situation closely to ensure the issue does not present again.

Posted Nov 22, 2018 - 19:34 UTC

Resolved

This incident has been resolved.

Posted Nov 20, 2018 - 03:16 UTC

Update

We are continuing to monitor services in the EU-2 region. Please open a support ticket if you are still experiencing troubles.

Posted Nov 20, 2018 - 02:48 UTC

Monitoring

All services have been restored. If your site is not back up, please redeploy or create a support ticket.

Posted Nov 20, 2018 - 02:11 UTC

Identified

Services are recovering

Posted Nov 20, 2018 - 01:55 UTC

Investigating

We are currently investigating an outage with the Orchestration system on Platform EU-2 region.
Live and dev sites are affected.

Posted Nov 20, 2018 - 01:45 UTC

This incident affected: Europe (West 2) (eu-2.platform.sh).