Storage | S3 Chicago incident details
Incident Report for Gcore
Postmortem

Root Cause Analysis (RCA) - Outage of S3 Object Storage in Chicago (DRC)

Incident Summary: An outage occurred impacting the external S3 endpoint in our Chicago Data Center (DRC). This outage was due to the basic settings of our DDoS Protection service for external S3 Endpoint, which reached the DDoS RTBH policy.

Timeline (Converted to UTC): • 05:20 UTC: Incident onset. • 05:23 UTC: Gcore Engineers received an alert regarding the unavailability of the service and commenced investigation. • 05:47 UTC: Network Operations Center (NOC) and CyberOPS teams were engaged. • 06:12 UTC: The root cause was identified as the blackholing of traffic at the port. • 06:13 UTC: The port was removed from blackhole and a new DDoS configuration was applied. • 06:30 UTC: Service was fully restored and access to the external S3 endpoint was re-established.

Root Cause: The configured DDoS protection was set to a basic level that blackholed IP addresses with incoming traffic reached the DDoS RTBH policy. This measure, while protective, was not calibrated to the scale of legitimate traffic that the S3 service was experiencing. Consequently, legitimate traffic was mistaken for a DDoS attack, and the DDoS mitigation process incorrectly blackholed the IP, causing service disruption.

Contributing Factors: • DDoS Settings: The threshold for the DDoS protection was set without sufficient consideration for peak traffic patterns. • A total of 1 hour and 10 minutes of downtime for the external S3 endpoint. • Disruption of service for users relying on the Chicago DRC S3 object storage during the incident window.

Resolution and Recovery: • The incorrect blackholing was identified and reversed. • New configurations for DDoS protection were implemented to better reflect the traffic profile and prevent similar occurrences.

Action Points: • Comprehensive Review: Perform a detailed assessment of all external endpoints and associated DDoS Protection policies. • Policy Adjustment: Apply enhanced DDoS Protection policies that are tailored to withstand higher traffic thresholds and discern between legitimate traffic and DDoS attacks more effectively.

Future Preventative Measures: • Regularly update the DDoS protection configurations to adapt to the changing patterns of legitimate traffic and potential threat landscapes.

By addressing these action points and implementing the suggested preventative measures, we aim to fortify our defenses against DDoS attacks while maintaining service availability and reliability for our users.

Posted Nov 02, 2023 - 08:53 UTC

Resolved
We'd like to inform you that the issue has been resolved, and we are closely monitoring the performance to ensure there are no further disruptions. We will provide a Root Cause Analysis (RCA) report in the coming days to help you understand what caused the incident and the steps we have taken to prevent it from happening again in the future.

We apologize for any inconvenience this may have caused you, and want to thank you for your patience and understanding throughout this process.
Posted Nov 02, 2023 - 08:52 UTC
Monitoring
We are pleased to inform you that our Engineering team has implemented a fix to resolve the issue. However, we are still closely monitoring the situation to ensure stable performance.

We will provide you with an update as soon as we have confirmed that the issue has been completely resolved.
Posted Nov 02, 2023 - 07:36 UTC
Identified
We would like to inform you that our NOC and CyberOPS teams is actively working on resolving the issue with external endpoint accessing to S3 Storage. We understand that this issue may be causing inconvenience, and we apologize for any disruption to your experience. Please know that we are doing everything we can to fix it as soon as possible.
Posted Nov 02, 2023 - 07:26 UTC
Investigating
We are currently experiencing issues with object storage, which may result in service unavailability. We apologize for any inconvenience this may cause and appreciate your patience and understanding during this time.

We will provide you with an update as soon as we have more information on the progress of the resolution. Thank you for your understanding and cooperation.
Posted Nov 02, 2023 - 07:20 UTC
This incident affected: Object Storage (S3 Chicago).