CDN | DNS incident details
Incident Report for Gcore
Postmortem

issue

Gcore authoritative DNS was impaired for ~33 minutes

timeline

14.11.2022 19:53 (UTC) authoritative resolving has failed on a server in Turkey

14.11.2022 19:58 (UTC) a first monitoring alert about the DNS issue received

14.11.2022 20:02 (UTC) on-call engineer started work on the incident together with the DNS core team

14.11.2022 20:18 (UTC) team identified a cause

14.11.2022 20:26 (UTC) workaround applied: all DNS servers have been switched to the backup DNS zones propagation scheme

15.11.2022 13:59 (UTC) cache consistency was repaired, switched back to the primary delivery scheme

root-cause

  • customer filed a ticket that uppercase records were not working

    • we decided to fix it by storing the records as lowercased

      • as we use redis as a changes propagation system and it is case sensitive. We’ve duplicated all the keys at this point ( one with upper case and one with down case ). Then we purged the upper case duplicates which also triggered invalidating a dns service cache.

        • DNS server cache was managed in case-insensitive way. So we purged both ( upper and lower cased ) keys.

          • that led to the issue with DNS resolution (NOANSWER) until we stopped the redis cluster in order to fallback to the backup delivery scheme ( slow and stable )

impact

  • Up to 25% of requests to Gcore DNSaaS have failed for ~33 minutes

action items

  • to improve ops run-books in order to switch to the backup data replication scheme faster
  • to introduce an additional code linter for the case: all RRSET names must be in lowercase, and all record types must be in uppercase
  • to add a monitoring system alert for RRSET count difference between in-memory cache, edge database, and master database
  • to reduce DNS service restart time on the edges to minimize traffic loss
Posted Nov 22, 2022 - 21:53 UTC

Resolved
The issue has been resolved and we are monitoring performance closely. We are going to provide you with an RCA report in the following days.
We are very sorry for inconvenience caused by the incident!
Thank you for bearing with us!
Posted Nov 14, 2022 - 20:48 UTC
Monitoring
A fix has been implemented and we are monitoring the results.
The incident will be updated as soon as the issue is completely resolved.
Posted Nov 14, 2022 - 20:29 UTC
Investigating
We are currently experiencing performance's degradation
During this time, service may be unavailable for users
We apologize for any inconvenience it may cause. We will share an update once we have more information.
Posted Nov 14, 2022 - 20:06 UTC
This incident affected: CDN (DNS).