Outages in Germany and Netherlands regions

Incident Report for Platform.sh

Postmortem

We would like to give you more detailed information regarding the service outage that occurred in the two large Platform.sh regions, Germany and Netherlands, on the evening of August 7th, 2017. These outages happened during the planned maintenance windows for those regions and the performed maintenance directly caused it. Security related updates are performed as soon as they are available once our security team have reviewed them and concluded how they might affect our systems or our customers. Regular package upgrades are left to be performed during these maintenance windows.

The latest upgrade of one of our internal software packages changed the Systemd unit of our network configuration monitor and restarted it. That change by itself would not be sufficient to cause any issue if it wasn't coupled with one additional issue, a more sinister and hidden one. The Ceph monitor unit has an indirect requirement on the Systemd unit which restarted and therefore it restarted itself automatically in order to accommodate this change. At that point the real issue surfaced. Our team observed storage issues in both regions and discovered that the Ceph upstream package provided unit files that include options available only in the newest Systemd versions, without properly signaling that in package requirements. As a result, the Ceph monitor daemons refused to start and brought a lot of storage related issues in the region.

As soon as our team identified the actual issue, we quickly deployed a temporary fix and proceeded to recover the region. Some services accumulated a backlog of changes and it additionally slowed down recovery. After all the systems were up and running, we performed additional checks to make sure each and every environment is operational and assured no data was corrupted or in any case lost.

We are making several changes as a result of this operational event. Although deploying new features and fixing bugs, that our customers help us find, is key operational practice at Platform.sh, in this case, it should not be done simultaneously on several regions. Therefore, we are making maintenance compartmentation a requirement and experience with one maintenance will be extended into the next one. Also, we have put in place a more controlled way of managing starting units for these services and further upstream changes will not have to power to affect running systems anymore. Most importantly, we are putting highest priority on refactoring how we perform upgrades during maintenance windows, but you will be able to read more about that in the near future.

Finally, we apologize for any negative consequences this incident may have caused our customers. We learned a lot from this incident and we will do everything we can to improve our stability even further.

Posted Aug 08, 2017 - 14:20 UTC

Resolved

During the last maintenance window (on 2017-08-07 19-21h UTC) Germany and Netherlands regions have experienced sites outages that ranged from 30 minutes for some sites to full 3 hours of unavailability to others.

We have resolved the issues that caused this and will provide full post-mortem of the incident shortly.

Posted Aug 08, 2017 - 12:52 UTC