Incident Summary:
On November 18th, 2024, a planned tariff change caused significant miscalculations in client expenses for the Free and Start tariffs, leading to service pauses and unscheduled charges for several clients.
Timeline:
2024-11-18
13:18-13:29 UTC: The Start and Free plans were updated. A key change involved altering the metric of 10,000 PC Unit in the tariff plan to CDN - Number of requests (per 10,000 requests), with the default value being changed from 1,000,000,000 units to 100,000 TTS.
13:29 UTC: The daily billing feature began generating expenses 10,000 times higher than expected due to a miscalculation. Payments were incorrectly attempted and services were paused as the charges failed.
13:37 UTC: Customer support was contacted, prompting an immediate investigation.
14:50 UTC: The problematic tariff changes were reverted.
15:00 UTC: Incorrectly generated expenses were deleted.
15:15 UTC: Addendums were reactivated.
15:30 UTC: The Platform was re-activated and refunds for large payments were processed.
13:40-16:00 UTC: Additional payment refunds continued.
15:30-16:40 UTC: Fixes were applied to some clients' expenses.
Root Cause:
The incident was caused by expenses being calculated using values from different units. The billing system relies on data from two database tables: statistics (consumed value) and plan items (price, unit_size, default_value). When plan items were manually changed from PC to TTS units without creating a new plan, pre-calculated PC values were used with TTS units, leading to incorrect expense calculations. The calculation of expenses and statistics collection are executed independently, which allowed the inconsistent units to go undetected.
Impact:
Steps Taken for Resolution:
Preventive Measures:
We have reviewed our processes to ensure that similar issues do not occur in the future. This includes evaluating and enhancing our billing and tariff update procedures, improving monitoring and alert systems, and ensuring thorough testing before deploying changes. This incident has underscored the importance of rigor in our operational workflows, and we are committed to applying the lessons learned to enhance the reliability of our services.
We sincerely apologize for the disruption this incident has caused and appreciate your understanding as we work to prevent future occurrences.