We would like to apologize for any issues this degradation has caused to you or your customers. As you can see below, we performed detailed analysis of the incident and are already working on action items to provide you will the most reliable service in industry.
Issue
Our clients may have experienced HTTP 500 errors when requesting non cached content for a duration of approx 24-mins.
Impact
~3.3% traffic drop during Nov 30 13:57:00-14:21:26 UTC due to the issues with serving non-cached content.
Timeline
Nov 29 12:45:39 2021 UTC: new LUA code deployed on one host, no issues spotted
attempt to call method 'gcore_new' (a nil value)
в 12:48:51, 12:50:05, 12:50:08Nov 29 14:03:03 2021 UTC: new LUA code deployed on larger canary env ab-test (13 hosts), no issues detected
Nov 30 13:56:16 2021 UTC: new LUA code deployed on all CDN edge nodes
Nov 30 14:11:00 2021 UTC: changes were roll-backed
Nov 30 14:26:30 2021 UTC: count of HTTP 500 responses fell to baseline numbers
Root-cause:
1. CDN edge nodes running cache servers started responding with HTTP 500 for uncached content due to inability to get content from the origin
race condition would likely to arise during deployment of the new code version
but some cache server workers have not had this LUA code loaded because
mind the timeframe between last reload and first request to serve, during this window code was changed on the disk
2. We did not identify the issue during canary release because the engineer considered error rate as insignificant, compared with the global error rate graph
3. Lack of backward compatibility for internal code
4. During pre-prod and then canary releases, no LUA code warning was spotted
Action points
We will implement some new measures and quality gates, as well as improve current solutions.