Root Cause Analysis (RCA) - Outage of S3 Object Storage in Chicago (DRC)
Incident Summary: An outage occurred impacting the external S3 endpoint in our Chicago Data Center (DRC). This outage was due to the basic settings of our DDoS Protection service for external S3 Endpoint, which reached the DDoS RTBH policy.
Timeline (Converted to UTC): • 05:20 UTC: Incident onset. • 05:23 UTC: Gcore Engineers received an alert regarding the unavailability of the service and commenced investigation. • 05:47 UTC: Network Operations Center (NOC) and CyberOPS teams were engaged. • 06:12 UTC: The root cause was identified as the blackholing of traffic at the port. • 06:13 UTC: The port was removed from blackhole and a new DDoS configuration was applied. • 06:30 UTC: Service was fully restored and access to the external S3 endpoint was re-established.
Root Cause: The configured DDoS protection was set to a basic level that blackholed IP addresses with incoming traffic reached the DDoS RTBH policy. This measure, while protective, was not calibrated to the scale of legitimate traffic that the S3 service was experiencing. Consequently, legitimate traffic was mistaken for a DDoS attack, and the DDoS mitigation process incorrectly blackholed the IP, causing service disruption.
Contributing Factors: • DDoS Settings: The threshold for the DDoS protection was set without sufficient consideration for peak traffic patterns. • A total of 1 hour and 10 minutes of downtime for the external S3 endpoint. • Disruption of service for users relying on the Chicago DRC S3 object storage during the incident window.
Resolution and Recovery: • The incorrect blackholing was identified and reversed. • New configurations for DDoS protection were implemented to better reflect the traffic profile and prevent similar occurrences.
Action Points: • Comprehensive Review: Perform a detailed assessment of all external endpoints and associated DDoS Protection policies. • Policy Adjustment: Apply enhanced DDoS Protection policies that are tailored to withstand higher traffic thresholds and discern between legitimate traffic and DDoS attacks more effectively.
Future Preventative Measures: • Regularly update the DDoS protection configurations to adapt to the changing patterns of legitimate traffic and potential threat landscapes.
By addressing these action points and implementing the suggested preventative measures, we aim to fortify our defenses against DDoS attacks while maintaining service availability and reliability for our users.