Cloud | Luxembourg-1 (Compute - Instances, Block Storage - Block Volume, Networking - Public Network
Incident Report for Gcore
Postmortem

On September 6, 2024, a critical network incident occurred in the Luxembourg-1 (ED-9) data center and the Cloud Platform services. The issue arose due to multiple instances of MAC address flapping across some of the production VLANs, which triggered the EVPN protocol to blacklist the affected MAC addresses, disrupting network traffic.

The root cause of the incident was the excessive movement of MAC addresses between network interfaces, likely caused by misconfigurations or instability in the network topology. As a result, several MAC addresses were blacklisted and subsequently recovered as the network stabilized. The service disruption lasted approximately from 12:46 to 14:10 UTC, affecting API availability and related network services.

Immediate actions were taken to stop the flapping behavior, and automatic recovery of the blacklisted MAC addresses helped restore network traffic. Corrective actions are being implemented to prevent future occurrences, including removing list of VLANs, improving network isolation, and working with Arista to ensure stability configuration in EVPN configurations.

This incident highlights the need for enhanced network policy enforcement, stricter separation of production and development environments, and better monitoring to detect early signs of network instability.

Key Points:

  • Date: September 6, 2024

  • Impact: Cloud Platform services in Luxembourg-1 (ED-9) experienced disruption due to MAC address blacklisting.

  • Root Cause: MAC addresses flapping on VLANs caused by preliminary switch misbehavior. To be confirmed or rejected by Arista.

  • Mitigation: Automatic recovery of blacklisted MAC addresses and subsequent network reconfiguration.

  • Next Steps: VLAN cleanup, and network isolation to prevent recurrence

Posted Sep 10, 2024 - 15:17 UTC

Resolved
We'd like to inform you that the issue has been resolved, and we are closely monitoring the performance to ensure there are no further disruptions. We will provide a Root Cause Analysis (RCA) report in the coming days to help you understand what caused the incident and the steps we have taken to prevent it from happening again in the future.

We apologize for any inconvenience this may have caused you, and want to thank you for your patience and understanding throughout this process.
Posted Sep 06, 2024 - 14:36 UTC
Update
Our team still restoring the storage. Current ETA: 20-30 minutes.

Our team is now fully focused on rectifying the situation.
Posted Sep 06, 2024 - 14:03 UTC
Update
We'd like to let you know that our engineering team has successfully located the issue that is affecting Storage services in Luxembourg-1. The outage is caused by internal network storm and fully isolated issue. The engineering team is working to restore storage and services.


Our team is now fully focused on rectifying the situation.
Posted Sep 06, 2024 - 13:40 UTC
Identified
We'd like to let you know that our engineering team has successfully located the issue that is affecting Block Storage. The partial outage is caused by internal networking.

Our team is now fully focused on rectifying the situation.
Posted Sep 06, 2024 - 13:35 UTC
Monitoring
We are pleased to inform you that our engineering team has implemented a fix to resolve Major outage in Luxembourg-1 Network. However, we are still closely monitoring the situation to ensure stable performance. 

We will provide you with an update as soon as we have confirmed that the issue has been completely resolved.
Posted Sep 06, 2024 - 13:23 UTC
Investigating
We are currently experiencing a degradation in performance, which may result in service unavailability. We apologize for any inconvenience this may cause and appreciate your patience and understanding during this time.

We will provide you with an update as soon as we have more information on the progress of the resolution. Thank you for your understanding and cooperation.
Posted Sep 06, 2024 - 12:52 UTC
This incident affected: Cloud | Luxembourg-1 (Compute - Instances, Block Storage - Block Volume, Networking - Public Network).