Issues with Episerver Find EU
Incident Report for Optimizely Service
Postmortem

Summary

During Wednesday morning we had a FIND event in the EMEA region for one of our clusters. The following report describes additional details around that event.

EPiServer FIND is a platform service that extends the search features for Episerver, allowing you to build advanced filtering and faceted navigation based on the behaviour of website visitors. The service is used by both on-premise hosted applications and Digital Experience Cloud hosted applications.

Details

The first alert was triggered at 2016-10-12 07:28 CEST by our monitoring system. The alert was sent through our automated alert triage system to the technical team who takes action on these alerts.

The technical team started troubleshooting the issue at 07:31 and found the issues to be isolated to a specific elasticsearch cluster. After some initial investigation it was discovered that several nodes in the cluster was didn't answer to requests. The reason for this was problems with garbage collection. The technical team tried to do a gentle restart of these nodes but they were none responsive. The decision was made to do a hard reset of the whole cluster to get back to normal functionality as quick as possible. The restart was executed at 07:50 CEST. At 08:05 the global monitoring system reports the cluster to be functional again. The cluster reports full functionality back at 09:17.

07:28 CEST - First alarm is trigged.

07:31 CEST - The technical team starts investigation of the issue.

07:42 CEST - It is found that some of the nodes in the cluster don't respond to requests due to problems with garbage collection. A gentle restart of these nodes are tried but fails.

07:50 CEST - A hard reset of the whole cluster is performed to get back to normal functionality as quick as possible.

08:05 CEST - The global monitoring system is reporting that the cluster is functional again.

09:17 CEST - The cluster reports full functionality restored.

Impact on other services

During the event, applications using this specific cluster of FIND would have seen network timeouts or slow response times trying to connect to the service.

Corrective and Preventative Measures

This issue is related to the elasticsearch component of Episerver FIND. Elasticsearch will be upgraded to a newer version during the fall, this issue will be fixed as a part of that upgrade. This issue is also linked to periods of high load on the cluster. An ongoing work is to move indices from this cluster to others to spread the load.

Final Words

We apologize for the impact to affected customers. While we are proud of the availability we have on FIND we know how critical this service is to customers. For us, availability is the most important feature and we will do everything we can to learn from the event and to avoid a recurrence in the future.

Posted Nov 28, 2016 - 12:23 UTC

Resolved
This incident has been resolved.
Posted Oct 12, 2016 - 08:27 UTC
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Oct 12, 2016 - 07:20 UTC
Identified
The issue has been identified and a fix is being implemented.
Posted Oct 11, 2016 - 05:18 UTC