Cloud | Luxembourg-2 | Block Storage - Block Volume incident details
Incident Report for Gcore
Postmortem

RCA - Mass ed-c16-* disk latency issue (19.09.22 - 23.09.22)
(Luxembourg-2 region storage issue)

Summary:
The incident was caused by expanding CEPH storage in Luxembourg-2 Region with unexpected and aggressive internal usage of key-value in RocksDB which led to the massive OSD failure and major storage degradation for all virtual instances in Luxembourg-2. After CEPH upgrading and configuration tuning, the cluster came to a production state without data loss.

Timeline:

[19.09.2022 19:01 UTC] - As part of the CEPH cluster expansion, one of the CEPH nodes was rebooted

[19.09.2022 19:36 UTC] - We received an incident based on high LA for virtual machines in ED-16 / Luxembourg-2. Started investigation.

[19.09.2022 21:00 UTC] - Observe slow peering on several OSDs. Investigation.

[19.09.2022 22:00 UTC] - Changed the storage scrub configuration and peering was successful.

[20.09.2022 00:30 UTC] - Received multiple alerts on the operation of the storage system. Started investigation. The issue was resolved. Found a configuration and sharding problem, analysed the Pacific release, and researched bug release notes.

[20.09.2022 03:00 UTC] - The problem resumed on one of nodes (313), latency was high to RocksDB. Investigation.

[20.09.2022 06:00 UTC] - We stop the problematic OSDs on 313 node, and the latency on the other nodes was high.

[20.09.2022 06:00 UTC] - Adding new storage nodes with the updated version 16.2.10, put on the data recovery from 313 node to the new nodes.

[20.09.2022 09:00 UTC] - Nodes have been added. Observing latency on the storage cluster.

[20.09.2022 10:50 UTC] - Return to the cluster 313 node, the data recovery from the node continued.

[20.09.2022 11:00 UTC] - We observe cluster degradation, we notice abnormal operation of 315 node. Storage client latency was high. Investigation.

[20.09.2022 16:18 UTC] - We add more nodes from the new pool, and the situation had stabilized on the graphs, waiting for the data distribution from old nodes.

[20.09.2022 19:30 UTC] - The situation with delays worsened, the engineering team changed replication configuration, the replication situation stabilizes.

[20.09.2022 21:40 UTC] - The situation with delays had worsened again, delays on another node had increased, and the team changed the recovery configuration.

[20.09.2022 23:30 UTC] - Delays had stabilized, monitoring the status of data replication.

[21.09.2022 01:00 UTC] - High delays caused by the fall of specific OSD’s. Investigation.

[21.09.2022 06:00 UTC] - The crashes that lead to inaccessible PG and high delays occur during compaction, after the process begins to slow down, the OSD drops, turned off the fast_shutdown setting (unfortunately, did not help).

[21.09.2022 06:00 UTC – 22.09.2022 10:00 UTC] - During the day it was possible to reduce the number of OSD crashes, and data replication continues.

[22.09.2022 10:00 UTC] - Getting more tickets about virtual machines working slow with storage.

[22.09.2022 10:30 UTC] - Update another old OSD node to 16.2.10, restarted OSD.

[22.09.2022 13:00 UTC] - Provisioned, updated, and added 313 node, the situation had stabilized, the delays were at an acceptable level.

[22.09.2022 14:00 UTC] - Replication proceeded.

[22.09.2022 22:00 UTC] - During the compaction on the LVM we monitor 25k IOPS, which turn into 50k IOPS on the SSD - overutilization during the compacts.

[22.09.2022 23:00 UTC] - restarted the frozen OSD, there were -2 replicas + several inactive PG (0.4%).

[23.09.2022 00:00 UTC] - Fixing the recovery of inactive PGs.

[23.09.2022 01:20 UTC] - Restored data availability. Replication proceeded.

[23.09.2022 14:00 UTC] - Several nodes were restarted with new RocksDB parameters.

[23.09.2022 15:00 UTC] - Engineering team noticed that issue with big metadata amount for key-value was fixed for these nodes.

[23.09.2022 16:00 UTC – 20:45 UTC] - Adding new parameter for all OSD nodes in cluster. Restarting OSD and waiting for recovery.

[23.09.2022 21:00 UTC] - The cluster was stabilized and in production state

Root Cause:

  1. Bug in the CEPH 16.2.7 version - https://tracker.ceph.com/issues/55442 and https://github.com/ceph/ceph/pull/46096. Any rebalancing on the OMAP OSDs would lead to terrible performance until we compacted RocksDB.

Action points:

  1. Configuration parameters responsible for key-value storage (RocksDB) were tuned and released in Luxembourg-2 cluster.
  2. Added alerts on key-value DB metadata size increase for a proactive monitoring response.
  3. Audit the configuration for all clusters.
  4. Looking for alternative storage solutions and launching a new type of storage based on Linstor.

Finally, we want to apologize for the impact this event caused for you. We know how critical these services are to your customers, their applications and end users, and their businesses. We will do everything we can to learn from this event and use it to improve our availability even further.

Best regards,

Vsevolod Vayner

GCORE Cloud Product Director

Posted Sep 30, 2022 - 10:30 UTC

Resolved
The issue has been resolved and we are monitoring performance closely. We are going to provide you with an RCA report in the following days.
We are very sorry for inconvenience caused by the incident!
Thank you for bearing with us!
Posted Sep 24, 2022 - 09:34 UTC
Monitoring
Our engineering team has implemented a fix to resolve the issue.
We are still monitoring the situation and will post an update as soon as we gain stable performance.
We apologize for any inconvenience this may have caused and appreciate your patience.
Posted Sep 23, 2022 - 21:00 UTC
Update
Our Engineering team is aware of this issue and working on the fix.
We apologize for the inconvenience and will share an update once we have more information.
Posted Sep 23, 2022 - 01:45 UTC
Update
Our Engineering team is aware of this issue and working on the fix.
We apologize for the inconvenience and will share an update once we have more information.
Posted Sep 22, 2022 - 23:27 UTC
Update
Our Engineering team is aware of this issue and working on the fix.
We apologize for the inconvenience and will share an update once we have more information.
Posted Sep 20, 2022 - 18:00 UTC
Identified
Our Engineering team is aware of this issue and working on the fix.
We apologize for the inconvenience and will share an update once we have more information.
Posted Sep 20, 2022 - 10:54 UTC
Investigating
We are currently experiencing performance's degradation.
During this time, users may experience slow or unusual functioning of the service.
We apologize for any inconvenience this may have caused and will share an update once we have more information.
Posted Sep 20, 2022 - 06:23 UTC
This incident affected: Cloud | Luxembourg-2 (Compute - Instances, Block Storage - Block Volume).