Issue
On October 23, 2023, we encountered an outage in our Content Delivery Network (CDN) that began at 12:42 UTC and persisted until 13:57 UTC. This outage was triggered by a configuration error, resulting in all CDN edges returning 500 errors and significantly impacting the delivery of CDN resources.
Timeline
Root Cause
The root cause of the outage was an invalid system configuration file on the edges due to a syntax error. Furthermore, the Nginx configuration could not be loaded which led to a 500 response code. This issue was compounded by a violation of deployment policies, particularly the global change that had been committed without undergoing extensive testing.
Impact
The outage resulted in all CDN edges returning 500 response codes, affecting the majority of CDN resources for a duration of 28 minutes. After the fix was introduced, 25% of the edges continued to return 500 response codes for an additional 45 minutes.
Action Items
Finally, we want to apologize for the impact this event caused you. We know how critical these services are to you, your customers and your business. We will do everything we can to learn from this event and use it to improve our availability further.