Date
2024-01-17
Authors
Michael Zabka, CTO
Michal Artazov, DevOps Lead
Summary: Delayed System Queues and Offline Displays Due to CPU Credit Depletion
Impact: The incident resulted in delayed system queues for pings and provisioning, causing some displays to go offline. Once offline, these displays were unable to reconnect, leading to a disruption in service availability.
Trigger: A portion of platform instances ran out of CPU credits, initiating CPU throttling and high memory usage. Sluggish responses increased reconnection attempts, amplifying traffic and ultimately contributing to the outage.
Detection: The incident was detected through internal CPU throttling monitoring and customer tickets reporting display outages.
Root Causes: The primary root cause was the exhaustion of CPU allowance on certain platform instances, leading to throttling and increased memory usage, causing system delays.
Remediation:
* Enhance monitoring limits to detect similar issues sooner, enabling proactive intervention and preventing prolonged system delays.
* Switch to larger instances with improved CPU credit allowances to address the immediate resource constraints and minimize the risk of future incidents.
* Enhance autoscaling processes to dynamically adjust resources based on demand, resolving the impact without manual intervention. This ensures a more adaptive and responsive system.