Date
2024-03-29
Authors
Michal Artazov, DevOps Lead
Summary
An inefficiently designed task of the CQRS/ES system leads to a crash of one of the MongoDB replica sets.
Impact
Box and API toggled between degraded performance to being completely unavailable.
No devices in the field were affected and continued the operation as expected.
Trigger
Undetected accumulation of events in the Event Sourcing database over time.
Detection
Monitoring detected the crash of one of the MongoDB databases and alerted the engineer on-call duty.
Root Causes
Undetected accumulation of events in the Event Sourcing database over time leads to a gradual slowdown of future command processing. Eventually, it reached a critical point that caused too much data to be queried from the database at once which caused the database to crash.
Remediation
The team has implemented several steps to remediate the problem.