On October 1st between 14:49 and 15:44 CET, the Management API experienced a performance degradation. This affected some parts of the Norce Admin UI and resulted in longer response times and/or failed requests. The issue originated from a temporary network issue where the Management API Service intermittently couldn’t connect to our SQL infrastructure.
The problem was mitigated by routing requests from the Admin UI to a different set of API Services while troubleshooting continued.
The alarms we have in place for detecting longer response times and API downtime worked as expected and a team consisting of both developers and infrastructure engineers quickly assembled to address the problem. However, we are taking additional steps in ensuring quick reaction time when incidents like this occur and further analysis will be done to identify the underlying root cause for network instability.
Service Affected: Management API / Admin UI
Timeframe: Oct 1 14:49 – 15:44 CET
This incident was caused by a temporary issue with network stability isolated to instances of the Management API service. Norce Engineers responded fast and executed on the network stability playbook, including application and compute node restarts. When those actions didn’t mitigate the issue focus was instead turned to the Admin UI and changes were implemented that allowed the Admin UI to direct requests to an alternative set of APIs.
Norce Engineers responded in a timely manner and followed a well-established playbook for stability issues before making changes to the application.