Gcore Systems | Billing System incident details
Incident Report for Gcore
Postmortem

Incident Summary:
On November 18th, 2024, a planned tariff change caused significant miscalculations in client expenses for the Free and Start tariffs, leading to service pauses and unscheduled charges for several clients.

Timeline:
2024-11-18
13:18-13:29 UTC: The Start and Free plans were updated. A key change involved altering the metric of 10,000 PC Unit in the tariff plan to CDN - Number of requests (per 10,000 requests), with the default value being changed from 1,000,000,000 units to 100,000 TTS.
13:29 UTC: The daily billing feature began generating expenses 10,000 times higher than expected due to a miscalculation. Payments were incorrectly attempted and services were paused as the charges failed.
13:37 UTC: Customer support was contacted, prompting an immediate investigation.
14:50 UTC: The problematic tariff changes were reverted.
15:00 UTC: Incorrectly generated expenses were deleted.
15:15 UTC: Addendums were reactivated.
15:30 UTC: The Platform was re-activated and refunds for large payments were processed.
13:40-16:00 UTC: Additional payment refunds continued.
15:30-16:40 UTC: Fixes were applied to some clients' expenses.

Root Cause:
The incident was caused by expenses being calculated using values from different units. The billing system relies on data from two database tables: statistics (consumed value) and plan items (price, unit_size, default_value). When plan items were manually changed from PC to TTS units without creating a new plan, pre-calculated PC values were used with TTS units, leading to incorrect expense calculations. The calculation of expenses and statistics collection are executed independently, which allowed the inconsistent units to go undetected.

Impact:

  • Affected Customers/Resources: > 100 accounts in total;
  • The majority were Free tariff accounts;
  • Service was blocked for approximately 2 hours;
  • Incorrect charges were applied to several clients.

Steps Taken for Resolution:

  1. Immediate reversion of the tariff changes.
  2. Deletion of the incorrectly generated expenses.
  3. Reactivation of addendums and the affected platform services.
  4. Processing of refunds, prioritizing larger payments.
  5. Application of fixes to correct some clients' expenses.

Preventive Measures:
We have reviewed our processes to ensure that similar issues do not occur in the future. This includes evaluating and enhancing our billing and tariff update procedures, improving monitoring and alert systems, and ensuring thorough testing before deploying changes. This incident has underscored the importance of rigor in our operational workflows, and we are committed to applying the lessons learned to enhance the reliability of our services.

We sincerely apologize for the disruption this incident has caused and appreciate your understanding as we work to prevent future occurrences.

Posted Dec 03, 2024 - 10:16 UTC

Resolved
We'd like to inform you that the issue has been resolved, and we are closely monitoring the performance to ensure there are no further disruptions. We will provide a Root Cause Analysis (RCA) report in the coming days to help you understand what caused the incident and the steps we have taken to prevent it from happening again in the future.

We apologize for any inconvenience this may have caused you, and want to thank you for your patience and understanding throughout this process.
Posted Nov 18, 2024 - 17:52 UTC
Monitoring
We are pleased to inform you that our engineering team has implemented a fix to resolve an issue with wrong charges for CDN service in billing. However, we are still closely monitoring the situation to ensure stable performance. 

We will provide you with an update as soon as we have confirmed that the issue has been completely resolved.
Posted Nov 18, 2024 - 16:00 UTC
Identified
We'd like to let you know that our engineering team has successfully located the issue that is affecting our billing system.

Some of you can observed wrongly charges on your accounts. We will refund the amount that was mistakenly charged and you can expect the funds within the timeline depends on your bank terms.

Thank you for your understanding.
Posted Nov 18, 2024 - 15:15 UTC
Investigating
We are currently experiencing a degradation in performance, which may result in service unavailability. We apologize for any inconvenience this may cause and appreciate your patience and understanding during this time.

We will provide you with an update as soon as we have more information on the progress of the resolution. Thank you for your understanding and cooperation.
Posted Nov 18, 2024 - 14:49 UTC
This incident affected: Gcore Systems (Billing System).