Date
2022-06-29
Authors
Michael Zabka, CTO
Michal Artazov, DevOps Lead
Summary
signageOS Trial and Free-tier customers experienced increased traffic on some REST API endpoints between 20:30 - 22:00 UTC. That was followed by increased response time of some responses up to 60 seconds or more. Automated REST API monitoring system notified the DevOps team via PagerDuty and the team started analyzing the problem.
Shortly after, the team discovered the cause of the problem. The REST API connection to MongoDB database for trial and free-tier users was temporarily configured to use a migration database instance since the last deployment maintenance window. That was a human error, since the instance was not meant for long-term production traffic.
The DevOps team switched the database connection of REST API back to the original production instance of MongoDB. After redeploying all REST API instances, the issue got eliminated.
Impact
REST API minority set of endpoints - Applet management, Account session (login)
Endpoints/requests were delayed and in some cases, they didn’t complete in 60 seconds, which resulted in a timeout so they were responding with 50x error status codes.
Trigger
Unexpected higher traffic on newly deployed features related to Applet management.
Temporary MongoDB connection configured to the temporary cluster instance meant for migration purposes.
Detection
Detected by an automated monitoring system at 21:01 UTC.
Confirmed by tickets from customers on Trial 21:30 UTC.
Root Causes
Incorrect configuration of MongoDB database caused delayed responses due to insufficient CPU resources.
Remediation
The new alerts for the correct configuration of REST API service will be added to the internal monitoring.