Episerver Digital Experience Cloud™ Service (DXC Service) is the cloud-based offer from Episerver based on Microsoft cloud technology. A solution that delivers high availability and performance, easy connectivity with other cloud services and existing systems, ability to manage spikes in customer demand, and a platform that is ready to seamlessly adopt the latest technology updates.
On June 19th, 2019, DXC Service (DXC-S) customers were unable to run deployments and the following root cause analysis was provided by Microsoft.
Between the time period of 05:06 and 07:46 UTC on 19 Jun 2019, a subset of customers may have experienced latency, timeouts, or HTTP 500-level response codes while performing service management operations such as "site create", "delete" and "move resources". Auto-scaling and the loading of site metrics may also have been impacted. Azure Resource Manager (ARM) deployments containing App Service resources may have failed with the error message "Internal Server Error".
Microsoft determined that as part of ongoing works to drive platform resilience and ensure service stability, a configuration change was performed on the App Service Resource Provider – a part of the service architecture that deals with the processing of Service Management requests, such as “create”, “delete”, “update”, etc. The initial configuration change was applied successfully, but when a follow-up update was made, this caused an unexpected impact to the systems that handle service management requests, and a subset of customers experienced failures as a result. Existing App Service resources would not have been impacted by this issue, but auto-scale operation may also have failed during this time, which could have impacted a site’s ability to scale to meet demand.
The specific configuration which caused the issue is related to how App Service processes all management requests between regions. This logic is being updated continuously by Microsoft to increase availability/resiliency on region-by-region basis, but the logic encountered unexpected data (observable only in the production environment) during this specific update. This unexpected data was not handled gracefully, thus causing the logic to crash for a certain percentage of incoming requests (~1%). The result was that management requests which were impacted failed before they could have been even correctly processed.
Due to the nature of the App Service’s service management architecture, the impact to overall customer requests was limited, but the location of impact was not confined to any one specific region.
To mitigate, Microsoft isolated the specific update that had caused the issue, and then rolled back the update which restored service management functionality. Microsoft then monitored for an extended period to ensure that full service had been restored for customers.
Microsoft sincerely apologizes for the impact to affected customers. They are continuously taking steps to improve the Microsoft Azure Platform and their processes to help ensure such incidents do not occur in the future. They apologize for any inconvenience this may have caused.