Mass CDN outage details

Incident Report for Gcore

Postmortem

Issue

On October 23, 2023, we encountered an outage in our Content Delivery Network (CDN) that began at 12:42 UTC and persisted until 13:57 UTC. This outage was triggered by a configuration error, resulting in all CDN edges returning 500 errors and significantly impacting the delivery of CDN resources.

Timeline

October 23, 2023, 12:32 (UTC): A change was made to a system configuration file, which was later discovered to contain a syntax error.
October 23, 2023, 12:42 (UTC): The changes were gradually applied, and the first occurrence of a 500 error was observed.
October 23, 2023, 12:51 (UTC): The initial alert regarding data centre unavailability was received, initiating the investigation.
October 23, 2023, 13:05 (UTC): The incident was published on the Status Page, and a war room was established to coordinate the resolution.
October 23, 2023, 13:06 (UTC): The root cause of the incident was identified.
October 23, 2023, 13:07 (UTC): A bug fix was implemented, commencing the restoration process.
October 23, 2023, 13:10 (UTC): 75% of the edges were successfully restored.
October 23, 2023, 13:57 (UTC): The last occurrence of a 500 error was observed, marking the completion of the incident.

Root Cause

The root cause of the outage was an invalid system configuration file on the edges due to a syntax error. Furthermore, the Nginx configuration could not be loaded which led to a 500 response code. This issue was compounded by a violation of deployment policies, particularly the global change that had been committed without undergoing extensive testing.

Impact

The outage resulted in all CDN edges returning 500 response codes, affecting the majority of CDN resources for a duration of 28 minutes. After the fix was introduced, 25% of the edges continued to return 500 response codes for an additional 45 minutes.

Action Items

Reinforce the merging policy: Prohibit the merging of changes to the master branch without the approval of the lead engineer. This action item has already been implemented.
Strengthen the merging policy: Modify the configuration management tool merge flow so that all changes must pass through a pre-production environment and canary branches before reaching production.
Enhance validation for the system configuration file to identify configuration errors before they impact the production environment.
Rebuild the system configuration file reading and caching process to return the last working version when errors occur.

Finally, we want to apologize for the impact this event caused you. We know how critical these services are to you, your customers and your business. We will do everything we can to learn from this event and use it to improve our availability further.

Posted Oct 23, 2023 - 19:27 UTC

Resolved

We'd like to inform you that the issue has been resolved, and we are closely monitoring the performance to ensure there are no further disruptions. We will provide a Root Cause Analysis (RCA) report in the coming days to help you understand what caused the incident and the steps we have taken to prevent it from happening again in the future.

We apologize for any inconvenience this may have caused you, and want to thank you for your patience and understanding throughout this process.

Posted Oct 23, 2023 - 13:26 UTC

Monitoring

We are pleased to inform you that our Engineering team has implemented a fix to resolve the issue. All services which were utilising CDN delivery were affected (Website, CDN as a service and all APIs). However, we are still closely monitoring the situation to ensure stable performance.

We will provide you with an update as soon as we have confirmed that the issue has been completely resolved.

Posted Oct 23, 2023 - 13:23 UTC

Update

We are continuing to investigate this issue.

Posted Oct 23, 2023 - 13:22 UTC

Investigating

We are experiencing a degradation in performance, which may result in service unavailability of CDN Content Delivery, Admin and Client Control Panels. We apologize for any inconvenience this may cause and appreciate your patience and understanding during this time.

We will provide you with an update as soon as we have more information on the progress of the resolution. Thank you for your understanding and cooperation.

Posted Oct 23, 2023 - 13:08 UTC

This incident affected: Gcore Systems (Website, Customer Portal, Authorization, API, Reseller Portal), Managed DNS (DNS API), Cloud | Systems (API), Web Security (API), Streaming (API), Object Storage (API), DDoS Protection (API), and CDN (API, Content Delivery).