Issue
On October 23, 2023, we encountered an outage in our Content Delivery Network (CDN) that began at 12:42 UTC and persisted until 13:57 UTC. This outage was triggered by a configuration error, resulting in all CDN edges returning 500 errors and significantly impacting the delivery of CDN resources.
Timeline
- October 23, 2023, 12:32 (UTC): A change was made to a system configuration file, which was later discovered to contain a syntax error.
- October 23, 2023, 12:42 (UTC): The changes were gradually applied, and the first occurrence of a 500 error was observed.
- October 23, 2023, 12:51 (UTC): The initial alert regarding data centre unavailability was received, initiating the investigation.
- October 23, 2023, 13:05 (UTC): The incident was published on the Status Page, and a war room was established to coordinate the resolution.
- October 23, 2023, 13:06 (UTC): The root cause of the incident was identified.
- October 23, 2023, 13:07 (UTC): A bug fix was implemented, commencing the restoration process.
- October 23, 2023, 13:10 (UTC): 75% of the edges were successfully restored.
- October 23, 2023, 13:57 (UTC): The last occurrence of a 500 error was observed, marking the completion of the incident.
Root Cause
The root cause of the outage was an invalid system configuration file on the edges due to a syntax error. Furthermore, the Nginx configuration could not be loaded which led to a 500 response code. This issue was compounded by a violation of deployment policies, particularly the global change that had been committed without undergoing extensive testing.
Impact
The outage resulted in all CDN edges returning 500 response codes, affecting the majority of CDN resources for a duration of 28 minutes. After the fix was introduced, 25% of the edges continued to return 500 response codes for an additional 45 minutes.
Action Items
- Reinforce the merging policy: Prohibit the merging of changes to the master branch without the approval of the lead engineer. This action item has already been implemented.
- Strengthen the merging policy: Modify the configuration management tool merge flow so that all changes must pass through a pre-production environment and canary branches before reaching production.
- Enhance validation for the system configuration file to identify configuration errors before they impact the production environment.
- Rebuild the system configuration file reading and caching process to return the last working version when errors occur.
Finally, we want to apologize for the impact this event caused you. We know how critical these services are to you, your customers and your business. We will do everything we can to learn from this event and use it to improve our availability further.