We would like to extend our sincerest apologies for the inconvenience caused by the recent service disruptions. Our dedication to reliability and customer service remains a top priority, and we are sorry for any difficulties that our customers faced. Below is a detailed Root Cause Analysis (RCA) of the incident:
Issue:
BGP Route Leak from Gcore Cleaning centers caused unavailability of Cloud, Baremetal, WAAP services.
Timeline:
11.04.2025 07:32 (UTC) - Aggregated addresses feature roll out started for one client. Hiera-yaml changes committed.
11.04.2025 07:51 (UTC) - Puppet changed, merged to master branch
11.04.2025 08:11 (UTC) - Impact started; customers began reporting issues.
11.04.2025 08:18 (UTC) - Investigation started
11.04.2025 08:25 (UTC) - Initiation of emergency conference call; engineers and critical team members convened for investigation
11.04.2025 08:28 (UTC) - The root cause was identified and preparations for a rollback began.
11.04.2025 08:32 (UTC) - Mitigation started
11.04.2025 08.38 (UTC) - End of the impact
Root-cause:
• Cloud, Baremetal and WAAP were partially unavailable as the traffic towards them was blocked by Threat Mitigation System (TMS)
• TMS servers started announcing Gcore's network prefixes
• TMS agent delivered wrong prefixes to the nodes frr configuration
• TMS looked in the customers.<customer_id>.prefixes field and for each defined prefix x.x.x.x.x/32 created an aggregate prefix_agg = x.x.x.x.0/24 and rendered the template
◦ Liquid error: Unknown operator is
◦ this logic picked customers IPs (/32) from shared cloud networks and announced wider prefixes (/24) from TMS which did not have DDoS configuration for them. So traffic was dropped (default policy)
• faulty behavior was rolled out globally
• faulty behavior wasn't detected on preprod network
• we got no alert from TMS nodes about dropped packets
◦ missing alert for "Mellanox XDP Counters: Errors" counter (will be added only on 2025.04.17)
Impact:
Some customers and service cloud accounts were affected. Duration of the downtime ~30 minutes.
Action items:
• Implement changes in the testing process
◦ [notification of an upcoming update] Update the TMS agent on prod and preprod, so everyone can see the update
◦ [check full test flow] Build a sandbox for BGP tests
• Canary Deployment
◦ [enabling functionality temporarily] Use feature flags for sifter-agent to enable functionality for 5-10 minutes and then look at dashboards (faster than through puppet)
◦ [reducing impact] Roll out canary updates in a more limited way, not only by client_id, but in more safety regions (ex. WA2 location)
◦ [increasing observability] Add commit annotations to dashboards with traffic so you can see when something happened
• Improve procedures and policies
◦ [reducing impact] Update test procedures for sifter-agent.
◦ [reducing impact] Create a outbound prefix list which will filter all networks not included in the sifter configuration.
◦ [improving observability] Missing alert for "Mellanox XDP Counters: Errors" counter
Once again, we apologize for any inconvenience this may have caused. We greatly appreciate your patience and understanding throughout this incident, and we thank you for your cooperation.
Should you require further assistance or have any concerns, please feel free to reach out to our support team at support@gcore.com.