Temporary system degradation caused by Redis cache

Incident Report for signageOS

Postmortem

Date

2025-04-12

Authors

Václav Boch, DevOps Engineer

Summary

System degradation of multiple services due to a Redis database crash.

Impact

Temporary unavailability of the API, Box, and Platform services. No deployed devices were affected.

Trigger

A Redis instance designated as a cache ran out of memory and was rotated.

Detection

Our internal monitoring system detected high memory usage on one of our Redis cache servers.

Root Causes

Upon detection, we immediately began investigating the issue. Although the server initially had sufficient memory, an unexpected memory spike caused Redis to be terminated by the OOM (Out of Memory) Killer.

Unfortunately, automatic restarts were disabled in SystemD for this scenario, requiring manual intervention from a system administrator.

Additionally, no key eviction policy was configured in Redis to manage high memory consumption. This server was intended as a simple cache instance and was deployed as a single-instance replica. However, multiple services used it as a primary data source, leading to service unavailability when the instance crashed.

Remediation

While most of our Redis servers have replication enabled, we are in the process of deploying a more robust Redis solution. This new setup, enforced via Kyverno, will prevent single-instance Redis deployments in production.

Every Redis instance will be configured with policies to prevent crashes due to OOM events. We are implementing stricter internal policies and providing additional training for developers to prevent similar single points of failure in the future.

Posted Apr 03, 2025 - 17:23 CEST

Resolved

This incident has been resolved.
Posted Apr 02, 2025 - 11:34 CEST

Investigating

We are currently investigating this issue.
Posted Apr 02, 2025 - 11:28 CEST
This incident affected: API, Box, Platform, and Screenshots.