Users may be unable to access or use some Microsoft 365 services and features
ID: MO941162
Issue type: Incident
Status
Service Degradation
Impacted services
Microsoft 365 suite, Exchange Online, SharePoint Online, Microsoft Teams, Universal Print, Microsoft Purview
Details
Title: Users may be unable to access or use some Microsoft 365 services and features
User impact: Users may be unable to access or use some Microsoft 365 services and features.
More info: The affected scenarios are as follows:
Users may be unable to access Exchange Online using the following impacted connection methods:
Users may be unable to use the following features within Microsoft Teams:
-
Users are unable to create or update Virtual Events, including webinars and Town Halls.
-
Users may be unable to access or modify their calendar in Microsoft Teams. This would include loading calendar, viewing meetings, creating/updating meetings and joining meetings.
-
Users are unable to create chat, add users and create or edited meetings.
-
Users are unable to create or modify new teams and channels.
-
Users may be unable to update presence.
-
Users may be unable to use the search function.
-
Users may not see updated list of files and links failing to load within the Chat shared tab.
Users may experience the following issues with Microsoft Purview:
-
Users may be unable to access the Purview Portal, or Purview Solutions.
-
Users may experience delays in policy stamping and with Adaptive Scope Evaluations.
Users may be unable to export content or set and view labels within Microsoft Fabric.
Some Microsoft Fabric users with Purview Information Protection Policies with sensitivity labels enabled, may be unable to use interactive operations on Power BI Desktop format files and reports, including export operations on Fabric artifacts with Sensitivity labels applied.
Users may be unable to use the search feature within SharePoint Online.
Users may be unable to perform the following actions within Microsoft Defender for Office365.
-
Users may be unable to create simulations, simulation payloads or end user notifications.
-
Users may experience issues with delivery for end user notifications and simulation messages
-
Users may experiences issues with viewing simulation reports, and content.
Current status: We’ve updated the More Info section to include a list of impacted services and scenarios. We’re continuing to investigate recent changes, service telemetry and the failure paths to determine the root cause of impact.
Scope of impact: Impact is specific to users who are served through the affected infrastructure.
Start time: Sunday, November 24, 2024, at 1:20 AM UTC
Preliminary root cause: A portion of infrastructure which supports mailbox and calendar functionality isn't operating as expected, resulting in impact.
Next update by: Monday, November 25, 2024, at 12:00 PM UTC
Root Cause
There was a backend internal Microsoft 365 service that was being decommissioned. During the decommissioning workflow, a process issue meant that the traffic for this service was not disabled as expected, and the decommissioning process continued. This meant that once the service had been removed, traffic intended for this service was still generated by other services that had interactions with it.
After the service was removed, these requests originally went to the expected default endpoint for processing, however, as the service no longer existed, the logic of this default endpoint was to redirect requests to another endpoint as its backup. This ‘backup’ endpoint is the primary endpoint for the Exchange Online service and some other Microsoft 365 services and features. The requests are then managed by the front-end components in this endpoint (referred to as Client Access Front End (CAFE) for the rest of this document). As with the previous endpoint, because the backend service no longer existed, the CAFE routing service was unable to direct the request to this service. As part of the CAFE routing service's more thorough attempts to resolve to the backend service, a synchronous call was made inside an asynchronous code path, causing threads to be held for an extended duration while attempting the lookup.
The requests made to this service are triggered in response to user activity, so as peak weekday traffic began, the significant increase in the volume of requests going down the thorough resolution path exhausted the available threads in the CAFE routing service, causing the service to stall and all further requests to the routing service to fail, causing impact to other requests.
Some products and services, such as Outlook on the web, are architectured in a way that has more reliance on the CAFE routing service, and this is why they experienced higher levels of impact compared with other products, such as the Outlook desktop client.
Engineering Actions Summary
After monitoring alerting identified an issue with the Exchange Online service, we triggered a high priority event and started to investigate. We quickly identified timeout errors on some CAFE components and performed targeted restarts, however, the issue continued to occur on different components and returned on previously impacted components. The action of targeted restarts was performed throughout the duration of the issue to provide some relief.
We identified a recent change causing increased memory usage in other CAFE services, and believed this was the root cause. We reverted the change and performed restarts so it would take effect and release memory, however, after monitoring and testing we confirmed that the issue was still occurring.
After further analysis and root cause investigations, we identified that a different update related to the decommission of a backend service was the underlying cause of the issue. We implemented a patch that stopped traffic to this background service so that no new CAFE routing service instances would become impacted. We then performed scans to identify all the unhealthy CAFE servers and then restarted them to remediate impact. Additionally, we took some unhealthy CAFE servers out of active service to route requests to healthy components and expedite recovery.
Microsoft Update: From monitoring service telemetry, most users should now experience relief. We’ve completed our optimizations and we're continuing our period of extended monitoring to ensure the availability remains stable.
Nov 25, 2024 at 6:50 AM EST
We’ve identified a recent change which we believe has resulted in impact. We’ve begun to revert the change and are investigating what additional actions are required to mitigate the issue.
Next update by: Monday, November 25, 2024 at 9:00 AM EST