Episerver Search & Navigation (formerly Find) is a cloud-based enterprise search solution that delivers enhanced relevance and powerful search functionality to websites. On Wednesday April 14, 2021 and Thursday 15, 2021 we experienced events which impacted the functionality of the service in the US Digital Experience Cloud region. Details of the incident are described below.
Between April 14, 10:18 UTC and April 15, 2021 7:35 UTC the Search & Navigation cluster USEA01 experienced intermittent outages.
The issue was triggered by a consistent high level of JAVA heap memory consumption, which lead to failed shards allocation and garbage collects. A setting was reconfigured to trigger shard reallocation and reduce the high heap usage, and the service was fully operational at April 15, 2021 7:35 UTC.
April 14, 2021
10:18 UTC – First alert and automation restarts were triggered.
10:20 UTC – Critical alert triggered, acknowledged and investigation initiated.
11:34 UTC – STATUSPAGE updated
11:35 UTC – Restarted an unhealthy node and allocated shards.
_12:35 UTC – Service operation recovered.
_
16:17 UTC – Second alert and automation restarts were triggered.
17:33 UTC – Critical alert triggered and investigation immediately started.
18:38 UTC – STATUSPAGE updated
19:54 UTC – Issue identified and mitigation actions were performed.
20:11 UTC – Service operation was recovered and monitored.
April 15, 2021
00:36 UTC – First alert and automation restarts were triggered.
06:55 UTC – Second alert triggered and the issue was quickly identified.
07:09 UTC – STATUSPAGE updated
07:30 UTC – Mitigation actions were immediately performed.
07:32 UTC – Root cause identified and the Engineering team started working on long-term mitigation actions.
07:35 UTC – Critical alert resolved and service fully operational.
Investigation discovered that the shard balancing process was not triggered automatically due to an incorrect setting, which subsequently hindered the cluster from returning to normal state.
During the events, a subset of requests to Search and Navigation cluster may have experienced network timeouts (5xx-errors), or slow response times when trying to connect.
Short-term mitigation
Long-term mitigation
We apologize for the impact to affected customers. We have a strong commitment to deliver high availability for our Search & Navigation service. We will continue to prioritize our efforts in proving to overcome these recent difficulties, and will do everything we can to learn from the event to avoid a recurrence in the future.