Unavailability of Box and REST API
Incident Report for signageOS
Postmortem

Date
2024-03-29

Authors
Michal Artazov, DevOps Lead

Summary
An inefficiently designed task of the CQRS/ES system leads to a crash of one of the MongoDB replica sets.

Impact
Box and API toggled between degraded performance to being completely unavailable.

No devices in the field were affected and continued the operation as expected.

Trigger
Undetected accumulation of events in the Event Sourcing database over time.

Detection
Monitoring detected the crash of one of the MongoDB databases and alerted the engineer on-call duty.

Root Causes
Undetected accumulation of events in the Event Sourcing database over time leads to a gradual slowdown of future command processing. Eventually, it reached a critical point that caused too much data to be queried from the database at once which caused the database to crash.

Remediation
The team has implemented several steps to remediate the problem.

  • Improvements to the CQRS/ES task to prevent future excessive accumulation of events in the database.
  • Consolidation and cleanup of the events in the database to reduce them to a manageable number.
  • The team will discuss options to implement additional monitoring checks for future early detection of similar issues.
Posted Mar 29, 2024 - 10:30 CET

Resolved
Temporary unavailability of Box and REST API, with degraded performance.
Posted Mar 29, 2024 - 05:00 CET