Degraded Performance of Console and Deployment services on AU, AU-2 regions.

Incident Report for Platform.sh

Postmortem

What happened

A recent upgrade to regions AU.platform.sh and AU-2.platform.sh was unable to fully purge the stale network states within our routing infrastructure.

This caused client programs within our subsystems responsible for handling deployments, console and project API having busy-looping processes retrying invalid TCP connections indefinitely and consumed all the available CPU resources.

This caused the console to freeze, slow loading times when using project API functions, and delays in deployments or backups.

After further investigation, the Operations team has deployed an emergency fix to eliminate those busy-looping processes.

Our team has stopped the scheduled upgrades to other regions and is still working on a permanent fix.

Customer impact

From 2025-03-20 12:50 UTC to 2025-03-21 09:00 UTC, customers may experience delays when loading console, triggering deployments or taking backups for their projects in regions AU.platform.sh and AU-2.platform.sh .

From 2025-03-21 07:27 UTC to 2025-03-21 07:55 UTC, there may be issues with outgoing connections in AU.platform.sh environments, including live ones, due to our emergency fix deployment.

What was done to resolve the incident

The Operations team has taken out the busy-looping processes in the subsystems for deployments, console, and project API functions and deployed an emergency correction fix to the affected regions.

Incident has been resolved.

Posted Mar 23, 2025 - 08:34 UTC

Resolved

This incident has been resolved.
Posted Mar 23, 2025 - 05:06 UTC

Monitoring

After applying an emergency fix,
we can see that the services are now operational for both AU and AU-2 regions.

Our Operations team is still actively monitoring the services and implementing a permanent fix for this issue.
Posted Mar 21, 2025 - 09:10 UTC

Update

AU is now fully recovered.

Our engineers are still implementing a permanent fix for this issue.
Posted Mar 21, 2025 - 08:51 UTC

Update

The outgoing connections issues have been corrected after deploying an emergency fix.
Posted Mar 21, 2025 - 06:39 UTC

Update

We are continuing to work on a fix for this issue.
Posted Mar 21, 2025 - 06:35 UTC

Update

We noticed that some customer environments are failing to make outgoing connections.
Our engineers are still actively working on the recovery.
Posted Mar 21, 2025 - 06:28 UTC

Update

AU-2 is now fully recovered.
Posted Mar 21, 2025 - 05:23 UTC

Identified

We have detected an issue affecting service on the AU and AU-2 regions. Our Operations team has been notified and is currently working to resolve the issue. Projects on affected regions may experience delays when loading console, triggering deployments or taking backups.
This issue does not affect availability of any environments, only the API services for the project. There is no site downtime as a result of this issue.
Posted Mar 21, 2025 - 04:55 UTC
This incident affected: Australia (au.platform.sh) and Australia East (au-2.platform.sh).