Date
2025-04-12
Authors
Václav Boch, DevOps Engineer
Summary
System degradation of multiple services due to a Redis database crash.
Impact
Temporary unavailability of the API, Box, and Platform services. No deployed devices were affected.
Trigger
A Redis instance designated as a cache ran out of memory and was rotated.
Detection
Our internal monitoring system detected high memory usage on one of our Redis cache servers.
Root Causes
Upon detection, we immediately began investigating the issue. Although the server initially had sufficient memory, an unexpected memory spike caused Redis to be terminated by the OOM (Out of Memory) Killer.
Unfortunately, automatic restarts were disabled in SystemD for this scenario, requiring manual intervention from a system administrator.
Additionally, no key eviction policy was configured in Redis to manage high memory consumption. This server was intended as a simple cache instance and was deployed as a single-instance replica. However, multiple services used it as a primary data source, leading to service unavailability when the instance crashed.
Remediation
While most of our Redis servers have replication enabled, we are in the process of deploying a more robust Redis solution. This new setup, enforced via Kyverno, will prevent single-instance Redis deployments in production.
Every Redis instance will be configured with policies to prevent crashes due to OOM events. We are implementing stricter internal policies and providing additional training for developers to prevent similar single points of failure in the future.