Incident report
Incident type: Service Disruption in DE-1 region
Date of incident: February 12, 2018
Summary
Platform.sh monitoring detected slow deployments on the DE-1 region located on Microsoft Azure infrastructure at 8:05 AM (UTC). We started monitoring the situation more closely. We found deployments that were starting to take an inordinate amount of time.
The identified root cause was the orchestration database (Zookeeper) running out of memory and slowing down. Fixing the issue required a global reboot of all grid hosts which took longer than usual to recuperate due to a bug in a custom component of our coordination software. The hosts were rebooted at 11:05 (UTC) which resulted down-time to all the region hosted projects. Automatic recovery kicked-in correctly and all projects came up again automatically with no data corruption. By 12:25 (UTC) most projects recovered, some projects taking longer. The last development projects recovered at 17:00 (UTC).
The incident affected our professional offering, no triple-redundant Enterprise clusters were affected.
Mitigation
We have added monitoring for the internal Java memory to ensure this issue will not reoccur. Additionally we are expediting the process of upgrading our coordination software to a version not affected by the aforementioned component bug.