-
Notifications
You must be signed in to change notification settings - Fork 186
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Worker shutdown may lead to failed function invocations with ObjectDisposedException
(which may be persisted by the WebJobs listener)
#2687
Comments
We're getting the same kind of errors all over the place when using consumption plans; no behaviour so far that could help identify the faulting process. It fails about 5% of the time, randomly on our Azure Function App in consumption plan. Using App Service Plan on those function resolve the issue, but we would like to keep on using consumption plans. Issue started a few weeks ago when updating references to latest versions. Sample exception:
|
@danielmarbach 's issue was originally mine and he kinda posted it for me (thanks again) I'm still getting the same exceptions you posted @jsparent Where do you host your function, @jsparent ? |
@domenichelfenstein our function apps are located in Canada Central |
Are you using Windows or Linux, @jsparent ? |
@domenichelfenstein We're using Linux |
We are also experiencing the same issue. Our function app is running on a Linux-hosted consumption plan in the West Europe region. For reference:
|
Also experiencing the same error. Linux hosted consumption plan in East US. The function is queue triggered and has a queue output.
|
UPDATE: It happened on ~10 differents Function Apps, and it all stopped at once. While I'm glad this is no longer an issue, I'm pretty sure this is something happening in the "other" layer of the Function App Service. Hardware or routing maybe? |
Same here: Problem seems to be gone. But without an acknowledgement of the problem by Microsoft and a clear message what they've changed, I don't know if these exception could reappear all of a sudden. |
Unfortunately we encountered the error again a few hours ago (link issue). |
We received this same error with a Timer Triggered Durable Function running on an Windows Elastic Premium plan. I opened up a support ticket and will report back any findings. |
We encounter the same issue. We have Azure Functions on pricing tier "Consumption" and "Standard". Both hosted on "Windows" and "West Europe". All of them have the same issue. It seems that the problem started with the migration from ".NET6 in-process" to ".NET8 isolated". I can visualize this with the following KQL query:
|
That's the only potentially related issue I could find in the Service Bus SDK. It makes me sorrowful that there is no traction on this issue for such a long time. |
@danielmarbach the bug fix you mentioned could be related. I second the sorrowful feeling of @danielmarbach |
I've been looking at a similar exception from an Orchestrator function without much luck. It has so far only been a one-off occurrence.
|
@danielmarbach, @NoleNerd, @jbenettius and any others: can you quantify any impact you have here beyond an exception in your logs? We have investigated and we believe this is noise from application shutdown (most likely a scale in). The call stack points to the root service provider being disposed while we are in the middle of a function invocation. Root service provider disposal only happens on application shutdown. We are evaluating if we want to clarify this exception to indicate the invocation is aborted due to app shutdown but we have the following to consider:
|
@jviau It makes having a session queue handler running on .net 8 in isolation mode unusable. |
@jviau we saw an instance of our Durable Function Orchestrator fail with this exception. We had a sub-orchestrator waiting on an External Event, it received that External Event, the sub-orchestrator re-animated and did a little bit of work and finished. The parent/main orchestrator took over and did a little bit of work, called a couple more sub-orchestrators, but then it failed with this exception. Regarding the scaling - I can see that our Container App scaled to 1 pod around the time the requests came in (10/14 at 20:40) but I don't see it going past 1 pod. Our External Event message was received around 20:48 and the functions started doing work. I would not expect a scale in to occur here with active work. Based on the screenshot below, there should have still been an instance running at the time of exception. |
@NoleNerd can you share your app name? https://github.com/Azure/azure-functions-host/wiki/Sharing-Your-Function-App-name-privately |
@jviau , here you go: ExecutionTime: 2024-10-31T18:55:11Z |
@NoleNerd which version of durable worker extension are you using? |
@jviau, we're using Microsoft.Azure.Functions.Worker.Extensions.DurableTask 1.1.5 |
Thanks! Looking at @NoleNerd's apps logs and some others, all of these role instances where this occurred are indeed being terminated. This exception is just a symptom and not the root issue. As to why they shut down, I am not sure - that would be a question for the individual platform/sku teams. As for the impact this has on function triggers that is up to each trigger to handle this shutdown scenario. For durable specifically, we did make improvements to this scenario but there may be more work to do. For service bus, this may need to be discussed with the Azure SDK team as they own the service bus WebJobs extension. A workaround might be to disable auto-complete of messages and use We will need to discuss internally more how we want to proceed with this, as it is part platform/sku issue, part worker contract gap, part extension responsibility to be resilient to this. With that said, we do have a drain mode feature which avoids this in most scenarios. But there are some management actions where drain mode does not run and apps are forcibly shutdown (stopping the function app for example). Also, unsure if Functions on Azure Container Apps supports drain mode. |
This issue has been automatically marked as stale because it has been marked as requiring author feedback but has not had any activity for 4 days. It will be closed if no further activity occurs within 3 days of this comment. If you are not the original author (danielmarbach) and believe this issue is not stale, please comment with |
This issue has been automatically marked as stale because it has been marked as requiring author feedback but has not had any activity for 4 days. It will be closed if no further activity occurs within 3 days of this comment. |
We are facing with the same issue ~once per week (Consumption plan, West Europe):
In the AI traces, we see that the new host is started - but cannot handle any events, regardless of the trigger type. After ~10 minutes, host is down - and the next started one works fine. |
ObjectDisposedException
(which may be persisted by the WebJobs listener)
Using this issue to track further investigation and planning a fix. SummaryIn some cases, the dotnet worker is shutdown via SIGTERM, which causes all DI containers to dispose. In-flight invocations may then encounter an |
Description
When running the attached reproduction for a while (usually after a few hours) in functions using Linux (we tried it in Switzerland North) the following exception occurs
At the moment we don't think it is anything particular to do with the session in ASB or the ASB integration but rather with the function context handling in the middleware.
It could also be that it is related to the other issues in regard to ObjectDisposedExceptions.
#1929
I have already raised two PRs surrounding my research in this area but I'm unsure if they'll help
#2686
#2685
I have also made a comment about the use of TaskCompletionSource in the synchronization logic between the middleware.
Steps to reproduce
Run the repro on Service Bus for a while Experiment.zip
The text was updated successfully, but these errors were encountered: