Incident - REST API (SDK) applet upload and login endpoints failing with 500 response code
Incident Report for signageOS
Postmortem

Date

2022-06-29

Authors

Michael Zabka, CTO

Michal Artazov, DevOps Lead

Summary

signageOS Trial and Free-tier customers experienced increased traffic on some REST API endpoints between 20:30 - 22:00 UTC. That was followed by increased response time of some responses up to 60 seconds or more. Automated REST API monitoring system notified the DevOps team via PagerDuty and the team started analyzing the problem.

Shortly after, the team discovered the cause of the problem. The REST API connection to MongoDB database for trial and free-tier users was temporarily configured to use a migration database instance since the last deployment maintenance window. That was a human error, since the instance was not meant for long-term production traffic.

The DevOps team switched the database connection of REST API back to the original production instance of MongoDB. After redeploying all REST API instances, the issue got eliminated.

Impact

REST API minority set of endpoints - Applet management, Account session (login)

Endpoints/requests were delayed and in some cases, they didn’t complete in 60 seconds, which resulted in a timeout so they were responding with 50x error status codes.

Trigger

Unexpected higher traffic on newly deployed features related to Applet management.

Temporary MongoDB connection configured to the temporary cluster instance meant for migration purposes.

Detection

Detected by an automated monitoring system at 21:01 UTC.

Confirmed by tickets from customers on Trial 21:30 UTC.

Root Causes

Incorrect configuration of MongoDB database caused delayed responses due to insufficient CPU resources.

Remediation

The new alerts for the correct configuration of REST API service will be added to the internal monitoring.

Posted Jun 29, 2022 - 14:33 CEST

Resolved
signageOS Trial and Free-tier customers experienced increased traffic on some REST API endpoints between 20:30 - 22:00 UTC. It affected the response time of endpoints up to more than 60 seconds. Automated REST API monitoring system notified the DevOps team via PagerDuty and the team started analyzing the problem.
Shortly after, the cause of the problem was discovered. The REST API connection to MongoDB database for trial and free-tier users was temporarily configured to use a migration database instance since the last deployment maintenance window. The instance was not meant for long-term production traffic.
The DevOps team switched the database connection of REST API back to the original production instance of MongoDB. After redeploying all REST API instances, the issue got eliminated.
Posted Jun 28, 2022 - 23:00 CEST