Unexpected downtime on some projects in the AU region

Incident Report for Platform.sh

Postmortem

We encountered unexpected complications during our maintenance in the AU region on September 19th. We took some time to analyze the events and root causes to understand what went wrong and try new upgrade approaches in our testing region. We would like to share a timeline of events and identified root causes. We also apologize for the long downtime experienced, and thank you for your patience throughout this process.

On September 19th at 14:25 UTC, Platform.sh started a scheduled maintenance in the AU region. The goal was to enable cache tiering in the storage layer, an update that would significantly improve I/O performance across the region. This had been tested both in our testing and internal regions and the results were positive, which is why we decided to move forward with the release to a public region.

At 15:10 UTC, we executed the most critical part of the process, which consisted of adding fast storage instances to the current persistent layer and promoting a subset of them, constituting the new cache tier. The process went smoothly until the production traffic started hitting the empty cache storage. At this point the region experienced very high I/O, due to the amount of traffic between the two layers, which consequently increased the RAM consumption and resulted in multiple OOM (Out of Memory) errors in the cache instances, which had to be repeatedly rebooted. This prevented data in the cache tier from being effectively persistent, which lead some active environments to an inconsistent state. At 17:24 UTC, we decided to completely remove the new cache tier and rollback to the previous storage setup.

During the rollback process, our Support team identified that some projects in the region had their filesystem read-only and were not accessible. An incident was officially declared at 19:54 UTC and announced in our status page. Platform.sh started recovering all environments, giving highest priority to production environments and customers with Enterprise support. The recovery involved a full database recovery, backup and restoration for every project, which is why the process took several hours. At 01:37 UTC, we confirmed data loss on a small number of environments. Most of the affected data were temporary database tables that could be regenerated. In these cases tables were recreated and customers duly notified.

The Incident was officially resolved on September 20th at 14:34 UTC and affected a total of 32% of projects in the region.

Once again, we deeply regret the trouble this outage caused to all our customers and we thank you in advance for your continued partnership and trust.

Posted Oct 03, 2019 - 16:14 UTC

Resolved

We apologize for disruptions experienced during this event, and at this time we show that all affected environments have been verified as recovered.

Should you see any issues, please notify us by opening a ticket or joining us in the public slack channel.

We will investigate further into the cause of this interruption to prevent it from happening in the future, providing updates as they become available.

Posted Sep 20, 2019 - 14:34 UTC

Monitoring

We apologize for the time that it is taking to fully recover all environments. We are verifying each one that has been identified individually.

If your environment has not yet been recovered, or you have issues, please do not hesitate to open a ticket. As always, you can join us in our public slack channel.

We are continuing to resolve any remaining issues, and will update as additional information becomes available.

Posted Sep 20, 2019 - 12:35 UTC

Update

A small number of production environments in this region are still affected and are currently the focus of our recovery efforts.

Posted Sep 20, 2019 - 07:12 UTC

Update

We are continuing to recover projects, prioritizing production environments.

Posted Sep 20, 2019 - 05:13 UTC

Update

We are continuing to recover projects, prioritizing production environments.

Posted Sep 20, 2019 - 03:40 UTC

Update

We are continuing to recover projects, prioritizing production environments.

Posted Sep 20, 2019 - 01:31 UTC

Update

We are continuing to recover projects, prioritizing production environments.

Posted Sep 20, 2019 - 00:27 UTC

Update

The operations team is currently working to restore projects to their operational state.

Posted Sep 19, 2019 - 23:27 UTC

Update

We've had a problem with our overnight maintenance. The underlying storage problem has been addressed, and the operations team is working on recovering any affected projects.

Posted Sep 19, 2019 - 21:54 UTC

Update

The storage has been stabilised, and we're working on recovering affected environments

Posted Sep 19, 2019 - 21:42 UTC

Update

Our operations team are continuing to work on this issue.

Posted Sep 19, 2019 - 20:59 UTC

Identified

Some projects are experiencing unexpected downtime during the maintenance window in our AU region. Our operations team has identified the issue and is actively working on it.

Posted Sep 19, 2019 - 19:59 UTC

This incident affected: Australia (au.platform.sh).