Date: March 17, 2024
Authors: Václav Boch, DevOps Engineer
Summary
A critical failure in one of our MongoDB telemetry clusters resulted in a regional system-wide outage, impacting core services. Recovery time 10 mins.
Impact
- Complete regional outage of Box and API.
- Both services returned 503 errors, preventing customer access.
- Devices in the field were not impacted.
Trigger
The incident was triggered by a segmentation fault (SegFault) occurring in rapid succession across multiple servers in the MongoDB platform cluster.
Detection
- The issue was initially identified through internal monitoring.
- Customer support tickets confirmed the impact shortly after detection.
Root Cause
- A segmentation fault (SegFault) in the MongoDB server led to instability and service disruption.
Remediation Actions
- Schedule an upgrade to a newer MongoDB version across all clusters to enhance stability.
- Improve microservice dependencies to ensure that future incidents result in only partial outages rather than full system failures.