The issue with Global Content Delivery in CDN, Website and related services

Incident Report for Gcore

Postmortem

Timeline:

~10:53 UTC 17/10/23 — During a task involving changing CDN resources rendering rules for a client, an engineer accidentally used the wrong parameter in the corresponding API call.

~11:03 UTC 17/10/23 — Traffic dropped for the majority of resources including internal ones such as CDN API, website, and others.

~11:08 UTC 17/10/23 — The alert about https://gcore.com unavailability was escalated internally. ~11:11 UTC 17/10/23 — Investigation started.

~11:30 UTC 17/10/23 — Because api.gcore.com was unavailable, internal API access to the ingress resource in the Kubernetes cluster needed to be accessed. The internal configuration tool was restarted on all servers. ~11:34 UTC 17/10/23 — The internal protection service blocked CDN edges IPs in multiple DCs.

~11:37 UTC 17/10/23 — Traffic started recovering. Both nginx and nginx-cache were restarted on the entire CDN. Other artefacts were identified, such as cache loss, and an enormous amount of blocked requests by the internal protection service, which slowed down the full restoration.

~12:10 UTC 17/10/23 — CDN edges IPs have been unblocked, and the incident resolved.

Root cause:

All CDN resources, except those that were included in a new rendering group, were removed from all CDN edges due to misconfiguration of one of the parameters.
Presence of a technical possibility to render a limited number of CDN resources on all CDN edges.
Alerts for the traffic drop were based on a new raw log pipeline which is in testing mode. The alerts didn't trigger because the raw delivery pipeline was broken.
A CDN resource used for monitoring in external monitoring systems was in the new rendering group. It was delivered to all edges and the external monitoring system didn't trigger.
We couldn't easily find the previous state of the API endpoint that triggered the incident. The action causing the incident was not easily revertible.
During the CDN restoration, the internal protection system blocked shard members (edges) IPs in some DCs. It led to the unavailability of some DCs and partial cache drop.
It took a lot of time to find out that some shards are unavailable because of absence shards unavailability alerts.

Impact:

Unavailability of almost the whole CDN from ~11:03 UTC 17/10/23 to ~11:37 UTC 17/10/23, issues with traffic stability till ~12:10 UTC 17/10/23

Action items:

Use a pre-production environment for the CDN resources rendering options.
Introduce dynamic CDN edge whitelisting for the internal protection system.
Change to a better data source for traffic monitoring alerts.
Create additional CDN resources for monitoring with different rendering groups.
Update the disaster recovery journey and introduce it to all team members.
Configure shards unavailability alert.
Reconsider the architecture for the rendering option: it should be configured on the edge node itself in a declarative (gitops) manner.
New policy: infrastructure changes must be tested on the pre-production environment.

Posted Oct 20, 2023 - 14:02 UTC

Resolved

We'd like to inform you that the issue has been resolved, and we are closely monitoring the performance to ensure there are no further disruptions. We will provide a Root Cause Analysis (RCA) report in the coming days to help you understand what caused the incident and the steps we have taken to prevent it from happening again in the future.

We apologize for any inconvenience this may have caused you, and want to thank you for your patience and understanding throughout this process.

Posted Oct 17, 2023 - 12:30 UTC

Monitoring

We are pleased to inform you that our Engineering team has implemented a fix to resolve the issue. However, we are still closely monitoring the situation to ensure stable performance.

We will provide you with an update as soon as we have confirmed that the issue has been completely resolved.

Posted Oct 17, 2023 - 11:41 UTC

Update

We are continuing to work on a fix for this issue.

Posted Oct 17, 2023 - 11:30 UTC

Identified

We would like to inform you that our Engineering team is actively working on resolving the issue you are experiencing.

We understand that this issue may be causing inconvenience, and we apologize for any disruption to your experience. Please know that we are doing everything we can to fix it as soon as possible.

Posted Oct 17, 2023 - 11:26 UTC

Investigating

We are currently experiencing a global degradation in performance, which may result in service unavailability. We apologize for any inconvenience this may cause and appreciate your patience and understanding during this time.

We will provide you with an update as soon as we have more information on the progress of the resolution. Thank you for your understanding and cooperation.

Posted Oct 17, 2023 - 11:25 UTC

This incident affected: Gcore Systems (Website, Customer Portal, Authorization, API, Reseller Portal, API Documentation) and CDN (Content Delivery).