Issue:
HTTP 5xx errors in multiple DCs
Timeline (UTC):
07.12.2021 12:50 UTC multiple datacenters received a huge amount of established connections per server. Edge servers utilized 100% CPU in multiple locations: Frankfurt, Amsterdam, Hong Kong, Singapore, Tokyo, Osaka, Dakka, Milan, Paris, Los Angeles and Stockholm. On-call engineer received multiple alerts about high request time, high 5xx rate, and overfilled syn backlog.
07.12.2021 12:56 UTC excess traffic has been rebalanced to different DCs
07.12.2021 13:05 UTC affected CDN edge nodes didn't return back to the normal state after removing excess traffic
07.12.2021 13:06 UTC engineers started restarting the process of the HTTP server on each affected CDN server
07.12.2021 13:40 UTC HTTP server has been restarted on all affected edge servers.
07.12.2021 13:57 UTC connection limit has been applied to the CDN resources which received a huge amount of incoming connections
Root-cause:
Impact:
Users experienced increased latency and inability to receive content from the affected CDN locations(Frankfurt, Amsterdam, Hong Kong, Singapore, Tokyo, Osaka, Dakka, Milan, Paris, Los Angeles, Stockholm).
Action points: