Between 09:30 and 10:28 CET on October 24th, the Commerce Services API and Admin API experienced degraded performance resulting in increased response times and, in some cases, failed requests. The incident was caused by an unexpected surge in load on one of the SQL clusters within our multi-tenant production environment — a so-called “noisy neighbor” scenario where one tenant’s workload impacted the performance of others. The elevated load led to a cascading performance degradation, with automatic scale-out events and external retry mechanisms further amplifying the strain on both the database and application layers.
The issue was mitigated by isolating the tenant workload that caused the initial problem and restarting API instances that were stuck in a failed state, restoring full service functionality by 10:28 CET. Our alerting systems functioned as intended, enabling a coordinated and timely response from our engineering and infrastructure teams.
As a result of this incident, we are implementing additional safeguards to reduce the risk of similar occurrences. These include enhanced tenant workload isolation and improvements to application-layer recovery times. Together, these measures will strengthen the overall resilience of the platform and ensure faster recovery in the event of future disruptions.