Summary:
On November 20, 2024, from 11:10 CET to 12:10 CET, a new node addition to the storage network resulted in a network loop. This loop caused the storage backend to become partially unavailable, impacting operations and overall system performance.
Timeline:
• 11:10 CET: Initial network disturbance noticed, as latency started rising slightly.
• 11:11 - 11:13 CET: Latency spike observed for both read and write operations, reaching up to approximately 8ms.
• 11:12 CET: Engineer received a critical alert
• 11:12-11:20 CET: Node addition triggered a network loop, causing a broadcast storm, leading to degraded network conditions and making the storage backend partially unavailable.
• 11:21 CET: Node was identified as the cause, and its network connection was disabled.
• 11:25 CET: Storage services were restored, and latency returned to nominal levels.
• 11:25-12:10 CET: All customer services were checked and confirmed to be working as expected
Root Cause:
After changing network cards and not removing old configurations, it caused a network loop. The server's network ports do not generate or forward BPDU (Bridge Protocol Data Unit) packets effectively, resulting in the network failing to detect the loop through Spanning Tree Protocol (STP). This caused unknown-unicast storms due to the loop not being detected.
Action Plan:
1. Immediate Actions: The problematic node was disconnected from the network, the configuration was changed properly, and the node was moved to production.
2. Diagnostic Actions: Inspect the STP settings at both the server level and network level to verify alignment with best practices. 3. Preventive Measures: