RCA – Luxembourg-2 (ED-16), Storage Issue (SSD Low-Latency / Listor)
Summary
On 26-03-2024, a sequence of critical issues was triggered by a failed region test alert, leading to operational disruptions within the cloud environment, primarily involving Linstor storage anomalies. The official incident timeline commenced at 17:40 UTC, with the initial alert, and concluded at 21:51 UTC on 27-03-2024, following comprehensive recovery efforts encompassing node deactivations, manual file system repairs, and cluster synchronization procedures.
Timeline
· [26.03.2024 17:40 UTC]: The Cloud Operations Team received a notification of a failed region test alert.
· [26.03.2024 17:40 – 18:03 UTC]: The Cloud Operations Team investigated the issue, identified the root cause related to Linstor storage, and began implementing a workaround. This workaround involved disabling two storage nodes due to the main issue of disk skew at the LVM level, observed with disks added post-startup. · [26.03.2024 18:40 – 19:18 UTC]: The workaround was applied, and internal tests were rerun successfully. · [26.03.2024 20:31 UTC]: The Cloud Operations Team received a notification of a failed region test alert again.
· [26.03.2024 20:31-20:41 UTC]: The Cloud Operations Team confirmed the issue with Linstor storage and began investigating the root cause and potential solutions. · [26.03.2024 20:41-21:52 UTC]: A significant redundancy load was discovered following the initiation of volume auto-evacuation/replication, with approximately 2000 resources migrating between nodes, adversely affecting the DRBD cluster. · [26.03.2024 23:46 UTC]: The replication process was ongoing, with a lower than expected speed. · [27.03.2024 02:02 UTC]: A major portion of the replication was still underway. It was decided to wait for the completion of the full replication process, estimated to conclude by 5:30 AM, exceeding the initial 3-hour expectation. · [27.03.2024 05:30-06:21 UTC]: Following the completion of replication, the Cloud Team removed all failed volume creation tasks and volumes that were stuck. · [27.03.2024 06:21 UTC]: SSD Low-Latency management was reopened for volume creation. · [27.03.2024 06:52 UTC]: Reports were received of mass failures in instance booting due to file system failures. · [27.03.2024 06:52-14:48 UTC]: A significant increase in failed tasks and volumes was observed, impacting storage performance and overall management. This was compounded by DRBD resources getting stuck in the D-state due to network capacity being overwhelmed by the replication process. · [27.03.2024 14:48-21:51 UTC]: To address the unpredictable storage performance and errors, the Cloud Operations Team decided to sequentially reboot each storage node with full cluster synchronization with an overall maintenance duration of 7 hours. Maintenance included node-by-node reboot, synchronization, and clearing of failed tasks and stuck volumes. · [27.03.2024 21:51 UTC]: The cluster was fully recovered and operational.
Root Cause
Data Skew Due to Disk Volume Variance: The initial problem emerged from data skew caused by varying disk volumes on the first two nodes within the region. This variance led to failures during disk creation attempts, with errors being reported at the Logical Volume Manager (LVM) level.
Node Deactivation for Cluster Modification: To address the issue, two nodes were deactivated within the Linstor system with the intention of removing one node from the cluster and re-adding it after a complete data clearance. During this process, all disks that were concurrently present on nodes 1 and 2 were forced to operate with a single copy of data.
Automatic Linstor Evacuation: An automatic evacuation process within Linstor was triggered on the second node, as per default settings to activate after one hour.
Data Transfer Storm: The evacuation process led to a data transfer storm, consuming all available capacity.
Widespread Impact of the Data Storm: The storm severely impacted operations, including persistent requests to the Distributed Replicated Block Device (DRBD). Many synchronizing resources, approximately 2000, began to stall, as did the replication process itself.
Stabilization: Stabilization occurred as the data replication ended and all DRBD resources were up-state.
Action points
• Standardization of Node Configuration Across Regions: Align the configuration of older nodes to a unified cluster geometry with uniform disk sizes in all regions, as identified in ED and ANX. This action aims to mitigate data skew issues resulting from disk volume variance. The estimated time of accomplishment (ETA) for this action is the second quarter (Q2) of 2024. • Linstor Node Maintenance Instruction Update: Revise and update the Linstor node maintenance instructions to include the deactivation of auto-evacuation and auto-recovery features during maintenance activities. This update should also cover procedures for controlled restoration of cluster redundancy. This measure is intended to prevent the inadvertent triggering of data transfer storms and associated impacts. The ETA for this update is 14 April 2024.
These action points are designed to address the root causes identified in the RCA by enhancing cluster uniformity, refining maintenance protocols to prevent future incidents, and exploring file system alternatives for greater stability.
Finally, we want to apologize for the impact this event caused for you. We know how critical these services are to your customers, their applications and end users, and their businesses. We will do everything we can to learn from this event and use it to improve our availability even further.