System degradation
Incident Report for signageOS
Postmortem

Date

2024-01-29

Authors

Michael Zabka, CTO

Michal Artazov, DevOps Lead

Vaclav Boch, DevOps

Display connectivity issues and delayed queue processing

Impact: Caused display connection issues and slow system responses to received commands.

Trigger: Overloaded platform services.

Detection: Internal monitoring and customer tickets.

Root Causes: Our services directly communicating with displays got overloaded. This increased response times, causing displays to reconnect. After several failed attempts, the displays fell back to our backup system for communication. This produced a large number of system messages and overloaded our configuration servers, further increasing the general system latency.

Remediation:

  • Improved platform autoscaling to have more space for additional traffic.
  • Changed underlying EC2 instances to more powerful ones to cover the CPU spikes.
  • Improved monitoring to be more proactive in discovering similar incidents.
Posted Feb 03, 2024 - 17:37 CET

Resolved
This incident has been resolved.
Posted Jan 29, 2024 - 14:11 CET
Identified
The issue has been identified and a fix is being implemented.
Posted Jan 29, 2024 - 12:59 CET
This incident affected: API, Box, and Platform.