Telemetry degradation
Incident Report for signageOS
Postmortem

Summary

  • We have 3 replicas of mongo-telemetry that are used for storing Telemetry data (usually the last state of data for every device).
  • 1 replica is always primary and 2 replicas are secondaries.
  • The problem started with a PD alert of delayed oplog replication lag on the primary replica (that's unusual since the primary is the only writable replica and should never happen that oplog is delayed).
  • The fix required to stop MongoDB telemetry replicas.
  • The main indication of the resolution was following log in MongoDB Telemetry primary replica

    • "Flow control is engaged and the sustainer point is not moving. Please check the health of all secondaries."

Impact

  • Services API and Box didn't show some of the telemetries data properly for 40 minutes.
  • No impact on connected devices.

Trigger

PagerDuty - Grafana Alerts of MongoDB Telemetry 0, Replication lag

Detection

20:00 - immediately after the incident started

Root Causes

  • Page faults - probably on HW of MongoDB Telemetry instances

Remediation

  • Increase IOPS on MongoDB Telemetry Data disks to 6000 iops
  • Change the settings of MongoDB to HA. Reduce requirements on consistency on the MongoDB Telemetry.

    • MongoDB cluster should not depend on Secondaries. It should work as a standalone instance when all secondary replicas are down or unavailable.
Posted Feb 23, 2024 - 09:26 CET

Resolved
Temporary degradation in telemetry processing.
Posted Feb 17, 2024 - 20:00 CET