API services degradation

Incident Report for signageOS

Postmortem

Date

2025-04-01

Authors

Michal Artazov, DevOps Lead

Summary

System degradation of multiple services due to human error and system inefficiency.

Impact

Degradation of API, temporary unavailability of API, Box, Platform and Screenshots

Trigger

signageOS employee accidentally triggered a Bulk Action to all devices under the internal signageOS company, assigning an Applet to all of them. That included tens of thousands of devices.

Note: None of the devices were production devices. It's a mix of real devices in the signageOS lab and virtual devices used for load testing.

Detection

The trigger lead to a message queue growing in size due to the inability to process messages quickly enough. The alerting system alerted the team about this as soon as it happened.

Root Causes

There is a combination of factors that contributed.

  1. Bulk Action assigned the Applet even to Devices that have an Applet managed by a Policy. That in turn triggered an automatic process, that began reverting it for each such Device to ensure that they adhere to their Policy setting. This made the whole problem worse, doubling the amount of Applet assignments that needed to be processed.
  2. Service responsible for processing Bulk Actions flooded the system with too many Device changes that couldn't be properly handled in time, overloading the database. This in turn affected other services that depend on that database as the database performance degraded.
  3. An employee triggered the excessive Bulk Action without realizing that it will be triggered on that many devices.

The overall magnitude of this action exceeded 3 times the volume of the last performance test for processing bulk action/policy changes without the proper guardrails in place.

Remediation

We will address each part of the root cause separately.

  1. Bulk Actions optimization no 1. - devices that have Applet (or another property) managed by Policy should be skipped to reduce the system load
  2. Bulk Actions optimization no. 2 - Bulk Actions service should process devices in batches, always making sure that the system receives a reasonable load it can manage. That way, if there's a degradation, it will slow down Bulk Actions service only and won't affect the rest of the system.
  3. Improve Bulk Actions UX in Box - We will analyze how we can improve the UX to make accidental excessive Bulk Actions less likely.
Posted Apr 02, 2025 - 16:27 CEST

Resolved

System performance degradation caused by human error in combination with system inefficiency
Posted Apr 01, 2025 - 08:00 CEST