Telemetry degradation

Incident Report for signageOS

Postmortem

Summary

We have 3 replicas of mongo-telemetry that are used for storing Telemetry data (usually the last state of data for every device).
1 replica is always primary and 2 replicas are secondaries.
The problem started with a PD alert of delayed oplog replication lag on the primary replica (that's unusual since the primary is the only writable replica and should never happen that oplog is delayed).
The fix required to stop MongoDB telemetry replicas.
The main indication of the resolution was following log in MongoDB Telemetry primary replica
- "Flow control is engaged and the sustainer point is not moving. Please check the health of all secondaries."

Impact

Services API and Box didn't show some of the telemetries data properly for 40 minutes.
No impact on connected devices.

Trigger

PagerDuty - Grafana Alerts of MongoDB Telemetry 0, Replication lag

Detection

20:00 - immediately after the incident started

Root Causes

Remediation

Increase IOPS on MongoDB Telemetry Data disks to 6000 iops
Change the settings of MongoDB to HA. Reduce requirements on consistency on the MongoDB Telemetry.
- MongoDB cluster should not depend on Secondaries. It should work as a standalone instance when all secondary replicas are down or unavailable.

Posted Feb 23, 2024 - 09:26 CET

Resolved

Temporary degradation in telemetry processing.

Posted Feb 17, 2024 - 20:00 CET