Cloud | Network Incident details - Incident details

Postmortem

16 April 2025 at 07:39 GMT+0

Postmortem

16 April 2025 at 07:39 GMT+0

We would like to extend our sincerest apologies for the inconvenience caused by the recent service disruptions. Our dedication to reliability and customer service remains a top priority, and we are sorry for any difficulties that our customers faced. Below is a detailed Root Cause Analysis (RCA) of the incident:

Issue:

BGP Route Leak from Gcore Cleaning centers caused unavailability of Cloud, Baremetal, WAAP services.

Timeline:

11.04.2025 07:32 (UTC) - Aggregated addresses feature roll out started for one client. Hiera-yaml changes committed.

11.04.2025 07:51 (UTC) - Puppet changed, merged to master branch

11.04.2025 08:11 (UTC) - Impact started; customers began reporting issues.

11.04.2025 08:18 (UTC) - Investigation started

11.04.2025 08:25 (UTC) - Initiation of emergency conference call; engineers and critical team members convened for investigation

11.04.2025 08:28 (UTC) - The root cause was identified and preparations for a rollback began.

11.04.2025 08:32 (UTC) - Mitigation started

11.04.2025 08.38 (UTC) - End of the impact

Root-cause:

• Cloud, Baremetal and WAAP were partially unavailable as the traffic towards them was blocked by Threat Mitigation System (TMS)

• TMS servers started announcing Gcore's network prefixes

• TMS agent delivered wrong prefixes to the nodes frr configuration

• TMS looked in the customers.<customer_id>.prefixes field and for each defined prefix x.x.x.x.x/32 created an aggregate prefix_agg = x.x.x.x.0/24 and rendered the template

◦ Liquid error: Unknown operator is

◦ this logic picked customers IPs (/32) from shared cloud networks and announced wider prefixes (/24) from TMS which did not have DDoS configuration for them. So traffic was dropped (default policy)

• faulty behavior was rolled out globally

• faulty behavior wasn't detected on preprod network

• we got no alert from TMS nodes about dropped packets

◦ missing alert for "Mellanox XDP Counters: Errors" counter (will be added only on 2025.04.17)

Impact:

Some customers and service cloud accounts were affected. Duration of the downtime ~30 minutes.

Action items:

• Implement changes in the testing process

◦ [notification of an upcoming update] Update the TMS agent on prod and preprod, so everyone can see the update

◦ [check full test flow] Build a sandbox for BGP tests

• Canary Deployment

◦ [enabling functionality temporarily] Use feature flags for sifter-agent to enable functionality for 5-10 minutes and then look at dashboards (faster than through puppet)

◦ [reducing impact] Roll out canary updates in a more limited way, not only by client_id, but in more safety regions (ex. WA2 location)

◦ [increasing observability] Add commit annotations to dashboards with traffic so you can see when something happened

• Improve procedures and policies

◦ [reducing impact] Update test procedures for sifter-agent.

◦ [reducing impact] Create a outbound prefix list which will filter all networks not included in the sifter configuration.

◦ [improving observability] Missing alert for "Mellanox XDP Counters: Errors" counter

Once again, we apologize for any inconvenience this may have caused. We greatly appreciate your patience and understanding throughout this incident, and we thank you for your cooperation.

Should you require further assistance or have any concerns, please feel free to reach out to our support team at support@gcore.com.

Resolved

11 April 2025 at 09:09 GMT+0

Resolved

11 April 2025 at 09:09 GMT+0

We are happy to inform you that the network issue in our cloud service has been resolved. We will provide a Root Cause Analysis (RCA) report in the coming days to help you understand what caused the incident and the steps we have taken to prevent it from happening again in the future. However, if you continue to experience any issues, please do not hesitate to contact our support team. Our team will be happy to assist you and ensure that any further concerns are addressed promptly.

We appreciate your patience and understanding throughout this incident, and we thank you for your cooperation.

For further assistance, please contact our support team via support@gcore.com

Monitoring

11 April 2025 at 08:44 GMT+0

Monitoring

11 April 2025 at 08:44 GMT+0

We are pleased to inform you that our engineering team has implemented a fix to resolve the network issue in our cloud service. However, we are still closely monitoring the situation to ensure stable performance.

We will provide you with an update as soon as we have confirmed that the issue has been completely resolved.

Identified

11 April 2025 at 08:39 GMT+0

Identified

11 April 2025 at 08:39 GMT+0

We have identified the root cause and continue working on resolving the issue.

Investigating

11 April 2025 at 08:29 GMT+0

Investigating

11 April 2025 at 08:29 GMT+0

We are currently experiencing a significant degradation of the network in performance in many locations, which may result in a complete network unavailability. We sincerely apologise for any inconvenience this may cause, and greatly appreciate your patience and understanding during this critical time.

Our engineering team is actively working to identify the root cause and implement a resolution as quickly as possible. We will provide regular updates as we have more information on the progress of the resolution.

Thank you for your understanding and cooperation.

Gcore - Cloud | Network Incident details – Incident details

All systems operational

Cloud | Network Incident details