Incident Summary:
During a routine deployment of platform-v2, we encountered unexpected issues with showing the current device status.
Impact:
The deployment resulted in a significant surge in data and connections. This was due to all devices reconnecting simultaneously, averaging 4000 socket connections per minute. The reconnection phase was completed within 30 minutes, followed by a 40-minute period to process the backlog of data.
Partial Service Outage:
The incident led to partial service degradation, affecting the timely delivery of screenshots, certain device information, and telemetry data.
Remediation Steps:
To prevent a recurrence, we are implementing an improved deployment strategy. This will involve a more controlled and gradual rollout of changes to avoid sudden spikes in device connections. The updated process will ensure a smoother and more stable update experience.
Timeline of Events:
The incident began: 13:18 GMT Incident resolved: 14:12 GMT