Investigating possible connectivity issues for some regions

Incident Report for signageOS

Postmortem

Date

2023-08-14/15

‌

Authors

Michael Zabka, CTO

Michal Artazov, DevOps Lead

Vaclav Boch, DevOps

‌

Summary

Throughout Tuesday, 15th August 2023, signageOS experienced an incident causing delays in processing screenshots, device connections and some device telemetries. The issue was caused by a suboptimal check for connection type at the time devices are connecting to the signageOS for the first time and followed by a subsequent feedback loop that caused traffic to grow exponentially. This report aims to provide a detailed analysis of the impact, trigger, detection, root causes, and the steps taken for remediation.

‌

Impact

The issue has a negative impact on screenshots processing, establishing connection from newly provisioned devices and reporting some telemetries in a timely manner. No content playback was impacted. All devices run as expected.

‌

Trigger

Large number of new devices were provisioned at the same time.

‌

Detection

The issue was detected by our internal alerting system, which continuously monitors various metrics collected from our systems and triggers an alert when a metric crosses over a set threshold. Alerts are sent to PagerDuty, which notified the people currently on-call at that time. The alert was specifically triggered by a message queue size growing over the set threshold. This queue contains pending requests to process new screenshots, connections and telemetries. Through proactive monitoring, we were able to identify the issue and initiate the investigation promptly.

‌

Root Causes

The root cause was a crash of one of our MongoDB databases. Due to the sudden increase in the number of devices, it ran out of resources and crashed. In fact, two out of three replicas in the replica set crashed. This affected all services that use this database, causing an outage in processing of the screenshots and some telemetries. However, the services that handle incoming traffic from devices weren't affected, causing a large influx of pending data that can’t be processed. This, in turn, affected other services, causing a chain reaction.

‌

Remediation

The team has implemented several key optimizations that prevent this same problem from occurring in the future.

First optimization is shortening the path to process a new screenshot. Before, a new screenshot would first be sent to an intermediate service for validation and then to the final service that would write it to the database. The team has confirmed that the intermediate validation process was the bottleneck and that it can be safely removed because it doesn’t serve the purpose it was originally intended for anymore. This optimization has increased our speed of processing new screenshots 2x.

Secondly, the team has made processing of the affected telemetries faster by applying a less conservative write strategy. By default, a more conservative write strategy is used for all incoming requests that uses database transactions and waits for data to be replicated across the database replica set. This strategy wasn’t necessary in case of the affected data. In case of device telemetry, performance is the priority and in the worst case, temporary data loss is acceptable. By applying a more relaxed write strategy to this data, the processing speed of this kind of data has increased 10x.

Thirdly, the team has addressed one of the main issues that contributed to the situation. When a device goes online, it makes a request to our server and the server responds with configuration for that particular device. The device then uses this information to configure itself into the desired state. When the server receives a request from a brand new device, it will check whether the request came over HTTP or HTTPS and it compares that to the lastUsedProtocol field in the database. If they differ, it will create a new write request to write the new protocol. The issue was, that in the beginning, there’s no value in the database so every device that connects for the first time will generate a new write at least once. Due to several previously discussed factors, the processing of write requests became too slow. Devices make continuous requests to fetch configuration every minute. When the server received the next request from the same device, the previous request to write the protocol wasn’t yet processed so it compared the new protocol to the old empty value and generated another write request. This process kept repeating for that device, creating a feedback loop. To make matters worse, there were at least 5000 devices that were stuck in this loop at the time, generating new data at an exponential rate. The team has patched the code so that it treats an empty value of lastUsedProtocol as HTTPS. Since HTTPS is the default for all new devices, it prevented any more new data from being generated.

Last, but not least, the team has scaled up the affected MongoDB machines so that they have a larger buffer of free resources. The team has also improved the monitoring so that they’re notified in advance next time the database is running out of resources.

In modern systems, most serious issues are complex. There is always some trigger that throws other components out of balance, sometimes causing a chain reaction throughout the system. When that happens, it exposes weaknesses in the system’s architecture or in some of its components. It’s impossible to fully prevent this but it is possible to improve the ability to contain the issues and to fix discovered weaknesses. Our team takes these situations very seriously. For us, it’s always an opportunity to make the whole system even more stable. Next time, it can handle even more devices, more user activity and if there is another issue, its impact is more localised and affects a few number of features and services.

Posted Aug 18, 2023 - 19:15 CEST

Resolved

This incident has been resolved.

Posted Aug 14, 2023 - 23:12 CEST

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Aug 14, 2023 - 22:42 CEST

Investigating

We are currently investigating this issue.

Posted Aug 14, 2023 - 21:53 CEST

This incident affected: Platform.