As part of last night's scheduled maintenance security updates were rolled out in the EU region. System reboots were necessary in order to complete the installation of these updates. Reboots were scheduled to begin at 19:00 UTC. At approximately 19:30 UTC the operations team detected a problem with the gateway servers. This resulted in all projects being offline for several hours. The containers hosting the projects themselves were unaffected, but unreachable. This means no data was lost during the downtime.
Engineering was brought in to investigate and correct the problem. The root cause behind the issue was not immediately apparent and took several hours to identify. The issue was resolved shortly afterwards by patching an open source project that is part of our infrastructure. Gateways servers were brought back online and projects were accessible.
Production sites on Enterprise infrastructure were not affected by this incident.
In order to prevent similar outages in the future we have identified new metrics around resource allocation that can be monitored, and new monitoring is now in place. We will also contribute the patch to the library where a bug was identified so that the fix is available upstream.