RCA - Gcore Systems | Customer Portal, CDN, Cloud, Storage, DNS and Hosting
Summary: On 28-03-2024, a critical situation occurred due to a power outage at the ED datacenter, leading to disruptions in several systems and services. The sudden interruption had a significant impact, bringing the normal operations of various services associated with the datacenter to a halt.
Timeline from ED Data Center:
[28.03.2024 08:30 UTC] - power from Feed A was turned off due to planned upgrade/maintenance works leaving only Feed B
[28.03.2024 12:49 UTC] - smoke and a strong smell of burned material was detected in UPS room B.
[28.03.2024 12:57 UTC] – due to a high risk of fire ignition or even explosion the decision was taken to remove the load from the UPS by electrically isolating it (switch it to external bypass).
[28.03.2024 13:01 UTC] – generator DG.B1 was actioned to take the entire feed load, which under these circumstances meant the entire site load.
[28.03.2024 13:30 UTC] – after isolating the UPS B system the 4 UPS modules were opened for a visual inspection. In UPS B module 1 and 2 several capacitors were found damaged.
[28.03.2024 13:46 UTC] – the temperature of the lubrication oil inside the generator increased to 97°C and then stabilized itself at 99-100°C.
[28.03.2024 14:32 UTC] – the generator DG.B1 shut itself down due to the oil temperature having been too high (101-102°C)
Timeline for Gcore Services:
[28.03.2024 14:34 UTC]: Data Center power outage caused Gcore services shutdown.
[28.04.2024 14:36-14:50 UTC]: Restoration of backup power-initiated system restarts, including VMware storage and virtual machines. Multiple system failure alerts were triggered.
[28.04.2024 15:00-15:10 UTC]: Physical Kubernetes worker nodes encountered issues which were addressed and resolved. Infrastructure access was re-established.
[28.03.2024 15:11 - 15:24 UTC]: Initiated infrastructure storage restoration. Identified failed storage nodes and requested manual reactivation.
[28.04.2024 15:20-15:33 UTC]: DNS API issues were reported and resolved by temporarily disabling DNSSEC.
[28.04.2024 15:47-16:00 UTC]: Repaired ClickHouse Analytics and DDoS clusters.
[28.03.2024 15:37-18:23 UTC]: Fixed several services including Hosting, Baremetal, S3, Streaming platform, and Cloud API.
[28.04.2024 16:02-17:14 UTC]: Identified and resolved issues with ClickHouse Rawlogs cluster and RabbitMQ Metrics.
[28.03.2024 16:20-17:40 UTC]: Completed replacement of network devices within storage nodes and initiated CEPH recovery.
[28.03.2024 17:40 - 19:05 UTC]: Completed recovery of the Cloud Control Panel.
[28.03.2024 18:00 - 19:00 UTC]: Initiated restoration of customer resources in Luxembourg-1 and Luxembourg-2.
[28.03.2024 19:00 UTC - 29.03.2024 15:04 UTC]: Continued restoration of customer resources.
[28.03.2024 15:09 - 16:48 UTC]: Initiated recovery of Object Storage nodes and completed CEPH synchronization. Completed service restoration.
Root Cause – ED Datacenter:
Root Cause - Gcore Services:
10. RabbitMQ Metrics DNS edit UI: An attempt to remove RabbitMQ cluster members from the DNS alias was unsuccessful due to a UI glitch. This led to sync errors and inconsistently updated DNS records.
Action points: