Box not receiving health checks

Incident Report for signageOS

Postmortem

Date

2025-05-12

Authors

Vaclav Boch, DevOps Lead

Summary

On 12th May 2025, signageOS experienced a 20-minute degradation in the availability of device health check information (pings) displayed in the Box UI. During this time, users were unable to view up-to-date health data from devices, and some received false-positive offline alerts. The issue was reported by customers. No devices in the field were impacted, and no other services or functionalities were affected.

Impact

  • Users of Box experienced missing or delayed device health (ping) updates.
  • Some users received false-positive notifications about devices being offline.
  • No actual devices were affected — all remained connected and operational.
  • No other parts of the signageOS platform were impacted.

Trigger

A misconfiguration in the alerting system for one of the three RabbitMQ instances prevented proper detection of a queue processing issue related to device health data.

Detection

The incident was reported by customers via support channels. Internal monitoring did not detect the issue due to the misconfigured RabbitMQ alert.

Root Causes

  • One of the three RabbitMQ instances had an alert misconfiguration, which led to a lack of visibility into the state of the message queues handling device ping updates.
  • This prevented the DevOps team from being notified of the degradation.
  • As a result, the health check data pipeline temporarily failed to deliver real-time updates to the UI.

Remediation

  • The misconfigured alert was corrected immediately upon identification.
  • A validation audit was conducted across all RabbitMQ instances and critical services to ensure proper alerting coverage.
  • Additional safeguards are being implemented to prevent blind spots in monitoring for partial system degradations.
Posted May 14, 2025 - 13:51 CEST

Resolved

This incident has been resolved.
Posted May 12, 2025 - 21:37 CEST

Monitoring

A fix has been implemented and we are monitoring the results.
Posted May 12, 2025 - 21:25 CEST

Identified

The issue has been identified and a fix is being implemented.
Posted May 12, 2025 - 21:19 CEST

Update

Based on the user report, there are no health checks/pings data available in Box. We are investigating the root cause.
Posted May 12, 2025 - 21:06 CEST

Investigating

We are currently investigating this issue.
Posted May 12, 2025 - 21:05 CEST
This incident affected: Box.