Date
2025-05-12
Authors
Vaclav Boch, DevOps Lead
Summary
On 12th May 2025, signageOS experienced a 20-minute degradation in the availability of device health check information (pings) displayed in the Box UI. During this time, users were unable to view up-to-date health data from devices, and some received false-positive offline alerts. The issue was reported by customers. No devices in the field were impacted, and no other services or functionalities were affected.
Impact
- Users of Box experienced missing or delayed device health (ping) updates.
- Some users received false-positive notifications about devices being offline.
- No actual devices were affected — all remained connected and operational.
- No other parts of the signageOS platform were impacted.
Trigger
A misconfiguration in the alerting system for one of the three RabbitMQ instances prevented proper detection of a queue processing issue related to device health data.
Detection
The incident was reported by customers via support channels. Internal monitoring did not detect the issue due to the misconfigured RabbitMQ alert.
Root Causes
- One of the three RabbitMQ instances had an alert misconfiguration, which led to a lack of visibility into the state of the message queues handling device ping updates.
- This prevented the DevOps team from being notified of the degradation.
- As a result, the health check data pipeline temporarily failed to deliver real-time updates to the UI.
Remediation
- The misconfigured alert was corrected immediately upon identification.
- A validation audit was conducted across all RabbitMQ instances and critical services to ensure proper alerting coverage.
- Additional safeguards are being implemented to prevent blind spots in monitoring for partial system degradations.