de-1.platform.sh: issues with deployments

Incident Report for Platform.sh

Postmortem

Incident report

Incident type: Service Disruption in DE-1 region

Date of incident: February 12, 2018

Summary

Platform.sh monitoring detected slow deployments on the DE-1 region located on Microsoft Azure infrastructure at 8:05 AM (UTC). We started monitoring the situation more closely. We found deployments that were starting to take an inordinate amount of time.

The identified root cause was the orchestration database (Zookeeper) running out of memory and slowing down. Fixing the issue required a global reboot of all grid hosts which took longer than usual to recuperate due to a bug in a custom component of our coordination software. The hosts were rebooted at 11:05 (UTC) which resulted down-time to all the region hosted projects. Automatic recovery kicked-in correctly and all projects came up again automatically with no data corruption. By 12:25 (UTC) most projects recovered, some projects taking longer. The last development projects recovered at 17:00 (UTC).

The incident affected our professional offering, no triple-redundant Enterprise clusters were affected.

Mitigation

We have added monitoring for the internal Java memory to ensure this issue will not reoccur. Additionally we are expediting the process of upgrading our coordination software to a version not affected by the aforementioned component bug.

Posted Feb 21, 2018 - 15:04 UTC

Resolved

This incident has been resolved.

Posted Feb 12, 2018 - 21:42 UTC

Investigating

The last update didn't fix the performance issue. Investigation is still under-going.

Posted Feb 12, 2018 - 11:48 UTC

Monitoring

The orchestration layer is requiring more computing resources to function normally. The operation team has applied changes to the configuration

Posted Feb 12, 2018 - 10:35 UTC

Update

The team is still finding the root cause of the performance issue in the coordination layer. We are not able to offer any ETA at the moment until the cause is identified.

Posted Feb 12, 2018 - 10:04 UTC

Investigating

The operation team confirmed that there's slowness in deployment and the orchestration system.
They're looking into the cause of the performance issue.

Posted Feb 12, 2018 - 09:04 UTC