Cloud | Frankfurt am Main-1 | Networking - Public Network, Cloud | Frankfurt am Main-1 | Networking - BGP, Cloud | Frankfurt am Main-1 | Managed Kubernetes incident details
Incident Report for Gcore
Postmortem

Summary:
On November 20, 2024, from 11:10 CET to 12:10 CET, a new node addition to the storage network resulted in a network loop. This loop caused the storage backend to become partially unavailable, impacting operations and overall system performance.

Timeline:
• 11:10 CET: Initial network disturbance noticed, as latency started rising slightly.
• 11:11 - 11:13 CET: Latency spike observed for both read and write operations, reaching up to approximately 8ms.
• 11:12 CET: Engineer received a critical alert
• 11:12-11:20 CET: Node addition triggered a network loop, causing a broadcast storm, leading to degraded network conditions and making the storage backend partially unavailable.
• 11:21 CET: Node was identified as the cause, and its network connection was disabled.
• 11:25 CET: Storage services were restored, and latency returned to nominal levels.
• 11:25-12:10 CET: All customer services were checked and confirmed to be working as expected

Root Cause:
After changing network cards and not removing old configurations, it caused a network loop. The server's network ports do not generate or forward BPDU (Bridge Protocol Data Unit) packets effectively, resulting in the network failing to detect the loop through Spanning Tree Protocol (STP). This caused unknown-unicast storms due to the loop not being detected.

Action Plan:
1. Immediate Actions: The problematic node was disconnected from the network, the configuration was changed properly, and the node was moved to production.
2. Diagnostic Actions: Inspect the STP settings at both the server level and network level to verify alignment with best practices. 3. Preventive Measures:

  • Configure port security, storm control, and enhanced STP settings to ensure any potential misconfigurations are mitigated.
  • Perform an additional check of older and new network configurations for nodes to ensure consistency and prevent similar issues.
Posted Dec 03, 2024 - 09:52 UTC

Resolved
We'd like to inform you that the issue has been resolved, and we are closely monitoring the performance to ensure there are no further disruptions. We will provide a Root Cause Analysis (RCA) report in the coming days to help you understand what caused the incident and the steps we have taken to prevent it from happening again in the future.

We apologize for any inconvenience this may have caused you, and want to thank you for your patience and understanding throughout this process.
Posted Nov 20, 2024 - 13:27 UTC
Monitoring
We are pleased to inform you that our engineering team has implemented a fix to resolve network issues to resources. However, we are still closely monitoring the situation to ensure stable performance. 

We will provide you with an update as soon as we have confirmed that the issue has been completely resolved.
Posted Nov 20, 2024 - 10:33 UTC
Investigating
We are currently experiencing a partial outage in our service's performance, which may result in partial unavailability for users. We apologize for any inconvenience this may cause and appreciate your patience and understanding during this time.

We will provide you with an update as soon as we have more information on the progress of the resolution. Thank you for your understanding and cooperation.
Posted Nov 20, 2024 - 10:20 UTC
This incident affected: Cloud | Frankfurt am Main-1 (Networking - Public Network, Networking - BGP, Managed Kubernetes).