Intermittent issues with Console, API and CLI
Incident Report for Platform.sh
Postmortem

Incident Summary

Our backend API response became unstable, rendering the Console and CLI unavailable for users. This instability disrupted services and impacted user experience.

What Happened

Our backend API experienced a combination of performance issues, including API gateway instability, Fastly performance degradation, increased incoming requests, and backend performance issues. These factors culminated in the disruption of API services, leading to the unavailability of the Console and CLI for users.

Customer Impact

The incident resulted in service disruptions for our users, preventing them from accessing critical functionalities through the Console and CLI.

Incident Resolution Steps

To address the incident and restore stability to the API, the following steps were taken:

  1. Temporary Mitigation: We promptly evacuated the API gateways from the miss behaving hosts as a temporary measure to mitigate the instability and restore basic functionality to the API.
  2. Monitoring and Analysis: Following the mitigation, we closely monitored the API performance to ensure stability and analyzed the impact of the incident on user experience.
  3. Root Cause Analysis: Subsequent to stabilizing the API, we conducted a comprehensive root cause analysis to identify the underlying factors contributing to the incident, including API gateway performance issues, Fastly performance degradation, increased incoming requests, and backend performance issues.
  4. Long-Term Remediation: Based on the root cause analysis, we documented long-term remediation, to improve overall system resilience and reliability.

Identified Root Cause and Impacts

The root cause of the incident is multifaceted, involving various components of our infrastructure:

  • API gateway performance issues
  • Fastly performance degradation
  • Increased incoming requests
  • Backend performance issues

These factors combined to create instability within the API, resulting in service disruptions and user inconvenience.

Posted Mar 25, 2024 - 09:55 UTC

Resolved
This incident has been resolved.
Posted Mar 25, 2024 - 08:16 UTC
Update
The underlying API performance is back within the expected threshold. The affected components are working as expected.
We are monitoring the availability of the API to ensure the performance stays within the expected thresholds.
Posted Mar 22, 2024 - 11:54 UTC
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Mar 21, 2024 - 08:32 UTC
Identified
The issue has been identified and a fix is being implemented.
Posted Mar 20, 2024 - 11:27 UTC
Investigating
We have detected an issue affecting the intermittent access to Console UI, API, and CLI on the UK 1 region.

Live sites are not impacted. We are actively working to restore service.
Posted Mar 20, 2024 - 11:21 UTC
This incident affected: United Kingdom (uk-1.platform.sh).