Gcore Systems | Customer Portal, CDN, Cloud, Storage, DNS and Hosting incident

Incident Report for Gcore

Postmortem

RCA - Gcore Systems | Customer Portal, CDN, Cloud, Storage, DNS and Hosting

Summary: On 28-03-2024, a critical situation occurred due to a power outage at the ED datacenter, leading to disruptions in several systems and services. The sudden interruption had a significant impact, bringing the normal operations of various services associated with the datacenter to a halt.

Timeline from ED Data Center:

[28.03.2024 08:30 UTC] - power from Feed A was turned off due to planned upgrade/maintenance works leaving only Feed B

[28.03.2024 12:49 UTC] - smoke and a strong smell of burned material was detected in UPS room B.

[28.03.2024 12:57 UTC] – due to a high risk of fire ignition or even explosion the decision was taken to remove the load from the UPS by electrically isolating it (switch it to external bypass).

[28.03.2024 13:01 UTC] – generator DG.B1 was actioned to take the entire feed load, which under these circumstances meant the entire site load.

[28.03.2024 13:30 UTC] – after isolating the UPS B system the 4 UPS modules were opened for a visual inspection. In UPS B module 1 and 2 several capacitors were found damaged.

[28.03.2024 13:46 UTC] – the temperature of the lubrication oil inside the generator increased to 97°C and then stabilized itself at 99-100°C.

[28.03.2024 14:32 UTC] – the generator DG.B1 shut itself down due to the oil temperature having been too high (101-102°C)

Timeline for Gcore Services:

[28.03.2024 14:34 UTC]: Data Center power outage caused Gcore services shutdown.

[28.04.2024 14:36-14:50 UTC]: Restoration of backup power-initiated system restarts, including VMware storage and virtual machines. Multiple system failure alerts were triggered.

[28.04.2024 15:00-15:10 UTC]: Physical Kubernetes worker nodes encountered issues which were addressed and resolved. Infrastructure access was re-established.

[28.03.2024 15:11 - 15:24 UTC]: Initiated infrastructure storage restoration. Identified failed storage nodes and requested manual reactivation.

[28.04.2024 15:20-15:33 UTC]: DNS API issues were reported and resolved by temporarily disabling DNSSEC.

[28.04.2024 15:47-16:00 UTC]: Repaired ClickHouse Analytics and DDoS clusters.

[28.03.2024 15:37-18:23 UTC]: Fixed several services including Hosting, Baremetal, S3, Streaming platform, and Cloud API.

[28.04.2024 16:02-17:14 UTC]: Identified and resolved issues with ClickHouse Rawlogs cluster and RabbitMQ Metrics.

[28.03.2024 16:20-17:40 UTC]: Completed replacement of network devices within storage nodes and initiated CEPH recovery.

[28.03.2024 17:40 - 19:05 UTC]: Completed recovery of the Cloud Control Panel.

[28.03.2024 18:00 - 19:00 UTC]: Initiated restoration of customer resources in Luxembourg-1 and Luxembourg-2.

[28.03.2024 19:00 UTC - 29.03.2024 15:04 UTC]: Continued restoration of customer resources.

[28.03.2024 15:09 - 16:48 UTC]: Initiated recovery of Object Storage nodes and completed CEPH synchronization. Completed service restoration.

Root Cause – ED Datacenter:

The root cause for the capacitor’s degradation was determined by the vendor Borri/Legrand and it been proven to be a premature aging of these due to too high reactive currents generating an abnormal increase of temperatures inside these.
Generator DG.B1 overheating because of its internal glycol/oil plate heat exchanger unit became clogged with particles which reduces the oil flow inside and therefore reduces its ability to keep the oil temperature at normal levels at such high load.

Root Cause - Gcore Services:

Zabbix monitoring: Had no support for High Availability (HA) and redundancy. Unavailable for 30 minutes due to location in a single data center.
Prometheus monitoring: Alerts dropped due to a connectivity issue with the alert manager. It was out of operation for about 20 minutes.
LibreNMS monitoring: Auto-start of php-fpm failed due to a CentOS bug. Database connection issues were also experienced, leading to an approximate 2-hour downtime.
K8s workers and ed-Prometheus servers: Didn't boot automatically due to BIOS settings, leading to a delayed boot of around 25 minutes.
Clickhouse clusters: Experienced software and hardware issues, needing manual repair. They were unavailable for about 1-1.5 hours.
DNS-API: Faced a 40-minute recovery delay due to connectivity issues with the DNSSEC vault cluster. DNSSEC functionality issues persisted for around 3 days.
Postgresql databases: Replication failed to automatically restore after master servers restarted. Also, the monitoring alert for broken replication had low priority.
RabbitMQ Metrics HA Cluster: Outage due to a bug in the federation plugin, disabling the mirroring feature within the cluster and causing connection issues.
RabbitMQ Metrics Data loss: Data loss occurred due to the cluster's autoheal mechanism during network splits and manual removal of the Mnesia database. The service was down for about 1.5 hours.

10. RabbitMQ Metrics DNS edit UI: An attempt to remove RabbitMQ cluster members from the DNS alias was unsuccessful due to a UI glitch. This led to sync errors and inconsistently updated DNS records.

Action points:

Replace capacitors in all UPSs. Done
Replace heat exchanger on overheated generator. Done
Perform stress-tests to test UPSs and generator. Done
Implementation of a Secondary Standby Cluster for CloudAPI in DRF (Frankfurt) with Replication Enabled. ETA: 1 September 2024
Redesign Zabbix monitoring to include a high availability setup. Add 3rd zabbix mysql cluster node to 3rd DC.
Change BIOS settings for K8s workers and ed-prometheus physical servers.
Review cluster settings and tune the options for Clickhouse clusters.
Add additional cluster node to DNSSEC vault cluster. Install properly vault plugin. Done
Change the replication fail alert priority and logic for Postgresql databases. Done

Posted May 07, 2024 - 12:39 UTC

Resolved

This incident has been resolved.

Posted Apr 02, 2024 - 06:42 UTC

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Mar 30, 2024 - 12:37 UTC

Update

We would like to inform you that Hosting servers in the Luxembourg location are now operational.

Posted Mar 28, 2024 - 19:12 UTC

Update

We've partially restored VPS service in ED in Hosting, working on the full restoration.

Posted Mar 28, 2024 - 18:30 UTC

Update

We'd like to inform you that Cloud API has been restored. We're working on the rest services.

Posted Mar 28, 2024 - 18:27 UTC

Update

We are continuing to work on a fix for this issue.

Posted Mar 28, 2024 - 18:25 UTC

Update

Cloud API and UI are available

Posted Mar 28, 2024 - 18:23 UTC

Update

Hosting - Luxembourg (ED)

We are currently experiencing a partial outage affecting our Virtual Private Servers (VPS) in the Luxembourg (ED) data center. Our team is actively investigating the issue to restore full service as quickly as possible.

Virtual Dedicated Servers (VDS) are fully functional.

Posted Mar 28, 2024 - 18:07 UTC

Update

We are continuing to work on a fix for this issue.

Posted Mar 28, 2024 - 17:25 UTC

Update

S3 is operational

Posted Mar 28, 2024 - 16:56 UTC

Update

Luxembourg-2 cloud is recovering.

Posted Mar 28, 2024 - 16:35 UTC

Update

Luxembourg CDN traffic is rerouted to nearby locations.

Posted Mar 28, 2024 - 15:56 UTC

Update

Baremetal are now up and running.

Posted Mar 28, 2024 - 15:48 UTC

Update

Hosting (ED) has been fixed, now it's up and running.

Posted Mar 28, 2024 - 15:46 UTC

Update

DNS API has been fixed.

Posted Mar 28, 2024 - 15:37 UTC

Update

We are continuing to work on a fix for this issue.

Posted Mar 28, 2024 - 15:30 UTC

Update

Web Protection API has been fixed.

Posted Mar 28, 2024 - 15:29 UTC

Update

We'd like to inform you that Authorisation, Customer Portal, API, Admin Portal are now in Operation status. We will inform you once we have more updates.

Posted Mar 28, 2024 - 15:26 UTC

Update

We are continuing to work on a fix for this issue.

Posted Mar 28, 2024 - 15:24 UTC

Update

We are continuing to work on a fix for this issue.

Posted Mar 28, 2024 - 15:23 UTC

Identified

The issue has been identified and a fix is being implemented.

Posted Mar 28, 2024 - 15:18 UTC

Investigating

We are currently experiencing a degradation in performance, which may result in service unavailability. We apologize for any inconvenience this may cause and appreciate your patience and understanding during this time.

We will provide you with an update as soon as we have more information on the progress of the resolution. Thank you for your understanding and cooperation.

Posted Mar 28, 2024 - 14:46 UTC

This incident affected: Managed DNS (DNS API), Cloud | Luxembourg-1 (Compute - Instances, Baremetal, Networking - Public Network), Cloud | Systems (API, Customer Portal), Cloud | Luxembourg-2 (Compute - Instances, Networking - Public Network, Networking - Private Network, Baremetal), Gcore Systems (Customer Portal, Authorization, API, Reseller Portal), Object Storage (S3 Luxembourg), Hosting (Luxembourg (ED)), and CDN (Luxembourg).