Devices presented as offline in Box
Incident Report for signageOS
Postmortem

Incident Summary:

During a routine deployment of platform-v2, we encountered unexpected issues with showing the current device status.

Impact:

The deployment resulted in a significant surge in data and connections. This was due to all devices reconnecting simultaneously, averaging 4000 socket connections per minute. The reconnection phase was completed within 30 minutes, followed by a 40-minute period to process the backlog of data.

Partial Service Outage:

The incident led to partial service degradation, affecting the timely delivery of screenshots, certain device information, and telemetry data.

Remediation Steps:

To prevent a recurrence, we are implementing an improved deployment strategy. This will involve a more controlled and gradual rollout of changes to avoid sudden spikes in device connections. The updated process will ensure a smoother and more stable update experience.

Timeline of Events:

The incident began: 13:18 GMT Incident resolved: 14:12 GMT

Posted Nov 24, 2023 - 13:44 CET

Resolved
Incident Summary:
During a routine deployment of platform-v2, we encountered unexpected issues with showing the current device status.

Impact:
The deployment resulted in a significant surge in data and connections. This was due to all devices reconnecting simultaneously, averaging 4000 socket connections per minute. The reconnection phase was completed within 30 minutes, followed by a 40-minute period to process the backlog of data.

Partial Service Outage:
The incident led to partial service degradation, affecting the timely delivery of screenshots, certain device information, and telemetry data.

Remediation Steps:
To prevent a recurrence, we are implementing an improved deployment strategy. This will involve a more controlled and gradual rollout of changes to avoid sudden spikes in device connections. The updated process will ensure a smoother and more stable update experience.

Timeline of Events:
The incident began: 13:18 GMT
Incident resolved: 14:12 GMT
Posted Nov 22, 2023 - 13:38 CET