~10:53 UTC 17/10/23 — During a task involving changing CDN resources rendering rules for a client, an engineer accidentally used the wrong parameter in the corresponding API call.
~11:03 UTC 17/10/23 — Traffic dropped for the majority of resources including internal ones such as CDN API, website, and others.
~11:08 UTC 17/10/23 — The alert about https://gcore.com unavailability was escalated internally. ~11:11 UTC 17/10/23 — Investigation started.
~11:30 UTC 17/10/23 — Because api.gcore.com was unavailable, internal API access to the ingress resource in the Kubernetes cluster needed to be accessed. The internal configuration tool was restarted on all servers. ~11:34 UTC 17/10/23 — The internal protection service blocked CDN edges IPs in multiple DCs.
~11:37 UTC 17/10/23 — Traffic started recovering. Both nginx and nginx-cache were restarted on the entire CDN. Other artefacts were identified, such as cache loss, and an enormous amount of blocked requests by the internal protection service, which slowed down the full restoration.
~12:10 UTC 17/10/23 — CDN edges IPs have been unblocked, and the incident resolved.
Unavailability of almost the whole CDN from ~11:03 UTC 17/10/23 to ~11:37 UTC 17/10/23, issues with traffic stability till ~12:10 UTC 17/10/23