LIVE and VOD Streaming incident details

Incident Report for Gcore

Postmortem

Postmortem: Incident on 24.07.2024

We would like to extend our sincere apologies for the recent service disruption that you may have experienced. Below, you will find a detailed explanation of the incident, its causes, and the steps we are taking to prevent future occurrences.

Incident Details: – Downtime: 09:58:30 – 10:23:00 UTC (24.5 minutes) – Affected Services: Live streaming services, API

Root Cause: The root cause of the issue was a metadata conflict in DB cluster during a database migration operation.

Timeline: – 09:56:00 UTC: Deployment to production was started. This deployment included a new service version with a database migration. – 09:58:22 UTC: A metadata conflict in the MySQL protocol caused multiple nodes in the database cluster to go offline. – 10:00:16 UTC: Alerts related to 5xx errors in our API were received. – 10:01:14 UTC: The team began investigating and discovered suspicious activity with many requests being routed to unavailable database nodes. We identified the need to safely bootstrap the cluster to minimize data loss. – 10:23:00 UTC: The cluster was restarted in single mode and became operational. – 10:29:58 UTC: Checks were performed and the team was instructed to verify the live streams. – 10:43:30 UTC: Most of the cluster nodes were restored and switched to operational. – 10:53:30 UTC: Full DB cluster was restored and switched to operational.

Actions Taken: – Immediate resolution of the cluster outage by restarting in single-node mode. – Restoration of full DB cluster nodes to operational status.

Next Steps: – Improve our migration mechanism and/or tune our database clustering to mitigate similar issues in the future.

We deeply regret any inconvenience this incident may have caused. Please rest assured that we are taking comprehensive measures to enhance the reliability of our services. Thank you for your understanding and continued support.

If you have any questions or require further information, please do not hesitate to contact our support team.

Posted Aug 16, 2024 - 18:12 UTC

Resolved

We are happy to inform you that receiving the streaming configuration in LIVE streaming has been resolved.

The problem was in obtaining the configuration of how video after transcoding must be delivered over CDN.

The analysis showed that:
– LIVE delivery were affected in some regions only.
– VOD delivery was stable.

Full RCA will be provided later.

However, if you continue to experience any issues, please do not hesitate to contact our support team. Our team will be happy to assist you and ensure that any further concerns are addressed promptly.

We appreciate your patience and understanding throughout this incident, and we thank you for your cooperation.

Posted Jul 24, 2024 - 13:30 UTC

Monitoring

We are pleased to inform you that our engineering team has implemented a fix to resolve video processing in LIVE and VOD delivery. However, we are still closely monitoring the situation to ensure stable performance.

We will provide you with an update as soon as we have confirmed that the issue has been completely resolved.

Posted Jul 24, 2024 - 10:37 UTC

Investigating

We are currently experiencing a degradation in performance, which may result in service unavailability. We apologize for any inconvenience this may cause and appreciate your patience and understanding during this time.

Live and on-demand delivery are not available.

We will provide you with an update as soon as we have more information on the progress of the resolution. Thank you for your understanding and cooperation.

Posted Jul 24, 2024 - 10:12 UTC

This incident affected: Streaming (Streaming, VOD).