Postmortem: Incident on 24.07.2024
We would like to extend our sincere apologies for the recent service disruption that you may have experienced. Below, you will find a detailed explanation of the incident, its causes, and the steps we are taking to prevent future occurrences.
Incident Details: – Downtime: 09:58:30 – 10:23:00 UTC (24.5 minutes) – Affected Services: Live streaming services, API
Root Cause: The root cause of the issue was a metadata conflict in DB cluster during a database migration operation.
Timeline: – 09:56:00 UTC: Deployment to production was started. This deployment included a new service version with a database migration. – 09:58:22 UTC: A metadata conflict in the MySQL protocol caused multiple nodes in the database cluster to go offline. – 10:00:16 UTC: Alerts related to 5xx errors in our API were received. – 10:01:14 UTC: The team began investigating and discovered suspicious activity with many requests being routed to unavailable database nodes. We identified the need to safely bootstrap the cluster to minimize data loss. – 10:23:00 UTC: The cluster was restarted in single mode and became operational. – 10:29:58 UTC: Checks were performed and the team was instructed to verify the live streams. – 10:43:30 UTC: Most of the cluster nodes were restored and switched to operational. – 10:53:30 UTC: Full DB cluster was restored and switched to operational.
Actions Taken: – Immediate resolution of the cluster outage by restarting in single-node mode. – Restoration of full DB cluster nodes to operational status.
Next Steps: – Improve our migration mechanism and/or tune our database clustering to mitigate similar issues in the future.
We deeply regret any inconvenience this incident may have caused. Please rest assured that we are taking comprehensive measures to enhance the reliability of our services. Thank you for your understanding and continued support.
If you have any questions or require further information, please do not hesitate to contact our support team.