On November 17th, the Europe West Professional region suffered downtime from 08:23 to 09:05 UTC. Enterprise websites were unaffected.
On November 16th, after noticing a slowness in the orchestration database, a configuration change was applied. When the database was restarted, the data on one of the nodes was corrupted. This ultimately led to a deployments outage (not websites) until the database went back up after fetching the data from other nodes.
On Nov 17th, following the configuration change, we found that the orchestration layer was low on memory and not fully functional. Thus, we allocated more computing resources to the service. While the service was restarting to get the additional computing resources, the gateway layer crashed. The gateway layer automatically recovered right after that but it could not retrieve network configuration of containers from the orchestration layer which was restarting. As a result, the connection between the gateway layer and the containers could not be established. The containers were working, but unreachable from the outside world.
Once the orchestration layer was fully recovered after the restart, the gateway layer picked up the network configuration and everything returned to normal.