System queues for Pings and Provisioning is degraded
Incident Report for signageOS
Postmortem

Date

2024-01-17

Authors

Michael Zabka, CTO

Michal Artazov, DevOps Lead

Summary: Delayed System Queues and Offline Displays Due to CPU Credit Depletion

Impact: The incident resulted in delayed system queues for pings and provisioning, causing some displays to go offline. Once offline, these displays were unable to reconnect, leading to a disruption in service availability.

Trigger: A portion of platform instances ran out of CPU credits, initiating CPU throttling and high memory usage. Sluggish responses increased reconnection attempts, amplifying traffic and ultimately contributing to the outage.

Detection: The incident was detected through internal CPU throttling monitoring and customer tickets reporting display outages.

Root Causes: The primary root cause was the exhaustion of CPU allowance on certain platform instances, leading to throttling and increased memory usage, causing system delays.

Remediation:

  1. Improved Monitoring Limits:
* Enhance monitoring limits to detect similar issues sooner, enabling proactive intervention and preventing prolonged system delays.
  1. Change of Underlying Instances:
* Switch to larger instances with improved CPU credit allowances to address the immediate resource constraints and minimize the risk of future incidents.
  1. Implement Better Autoscaling Processes:
* Enhance autoscaling processes to dynamically adjust resources based on demand, resolving the impact without manual intervention. This ensures a more adaptive and responsive system.
Posted Jan 23, 2024 - 17:35 CET

Resolved
This incident has been resolved.
Posted Jan 17, 2024 - 13:51 CET
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Jan 17, 2024 - 13:47 CET
This incident affected: API, Box, and Platform.