Timeline:
~10:53 UTC 17/10/23 — During a task involving changing CDN resources rendering rules for a client, an engineer accidentally used the wrong parameter in the corresponding API call.
~11:03 UTC 17/10/23 — Traffic dropped for the majority of resources including internal ones such as CDN API, website, and others.
~11:08 UTC 17/10/23 — The alert about https://gcore.com unavailability was escalated internally. ~11:11 UTC 17/10/23 — Investigation started.
~11:30 UTC 17/10/23 — Because api.gcore.com was unavailable, internal API access to the ingress resource in the Kubernetes cluster needed to be accessed. The internal configuration tool was restarted on all servers. ~11:34 UTC 17/10/23 — The internal protection service blocked CDN edges IPs in multiple DCs.
~11:37 UTC 17/10/23 — Traffic started recovering. Both nginx and nginx-cache were restarted on the entire CDN. Other artefacts were identified, such as cache loss, and an enormous amount of blocked requests by the internal protection service, which slowed down the full restoration.
~12:10 UTC 17/10/23 — CDN edges IPs have been unblocked, and the incident resolved.
Root cause:
- All CDN resources, except those that were included in a new rendering group, were removed from all CDN edges due to misconfiguration of one of the parameters.
- Presence of a technical possibility to render a limited number of CDN resources on all CDN edges.
- Alerts for the traffic drop were based on a new raw log pipeline which is in testing mode. The alerts didn't trigger because the raw delivery pipeline was broken.
- A CDN resource used for monitoring in external monitoring systems was in the new rendering group. It was delivered to all edges and the external monitoring system didn't trigger.
- We couldn't easily find the previous state of the API endpoint that triggered the incident. The action causing the incident was not easily revertible.
- During the CDN restoration, the internal protection system blocked shard members (edges) IPs in some DCs. It led to the unavailability of some DCs and partial cache drop.
- It took a lot of time to find out that some shards are unavailable because of absence shards unavailability alerts.
Impact:
Unavailability of almost the whole CDN from ~11:03 UTC 17/10/23 to ~11:37 UTC 17/10/23, issues with traffic stability till ~12:10 UTC 17/10/23
Action items:
- Use a pre-production environment for the CDN resources rendering options.
- Introduce dynamic CDN edge whitelisting for the internal protection system.
- Change to a better data source for traffic monitoring alerts.
- Create additional CDN resources for monitoring with different rendering groups.
- Update the disaster recovery journey and introduce it to all team members.
- Configure shards unavailability alert.
- Reconsider the architecture for the rendering option: it should be configured on the edge node itself in a declarative (gitops) manner.
- New policy: infrastructure changes must be tested on the pre-production environment.