-
Notifications
You must be signed in to change notification settings - Fork 495
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance decrease when upgrade from 3.31.0 to 3.31.1 #4906
Comments
can you please share the diagnostic string for patch operation with slower and faster sdk version. Maybe we can compare them by putting side by side and find out, why you are getting high latency. |
I have gathered some diagnostics by implementing the solution mentioned on this post (1), which was linked in #3613. This has gathered data for 3+ days, in which only 3 calls to cosmos reported to have taken more than 500 ms, see zip below. In the average duration of our requests, I can still see clearly that the problem is occurring from our logging when looking at the request duration. |
I have done an investigation to further pinpoint the source of the issue. I have slowly removed parts of our code base related to CosmosDB to determine when the symptoms stopped showing. After doing this, it became clear that only having open a connection with the CosmosDB is enough to produce the latency problems. This means the problem is not related to any queries or such that we do. The code is following the best practices outlined in the documentation regarding only creating a single client throughout the apps lifetime. (1) I am currently looking into any configuration options that can be passed when setting up the connection. I was wondering if this problem is known and if you have any recommendations for the type of machine that we are using. As mentioned in the initial description, our app is hosted on Azure using the 'Premium v3 P0V3' plan. The symptoms mentioned are still present when using the default options as follows: Note: I also read in the best practices document (1) that it is recommended to use a machine that has at least 8-GB memory and 4-cores. I realize that the 'Premium v3 P0V3' does not fulfill these requirements. However, as mentioned before, a machine that does is at least twice as expensive as the one that we are currently using so I am trying to avoid having to scale up. (1) https://learn.microsoft.com/en-us/azure/cosmos-db/nosql/best-practice-dotnet |
Looks like these are the commits consumed The first one has the highest likelihood to change client-side resource consumption |
I do not have access to these links |
Understood - added them for engineer looking into this |
@LouisSikkes1 the attached diagnostics ZIP file is showing as not valid. Can you please re-add again? |
For me it's working fine if I download it. Nonetheless, here is the zip again. |
@FabianMeiswinkel, @sourabh1007 can you please try? |
Not accessible for me also. |
it looks like the logs are in rar format and not zip. If you change the extension to rar, you can extract it. The backend latency is very low. In 2 files I see pretty big transit time which would indicate a network issue: |
thanks @rysiekg, event stages are publicly documented at https://learn.microsoft.com/en-us/azure/cosmos-db/nosql/troubleshoot-dotnet-sdk-slow-request?tabs=cpu-new#requesttimeline
For requests with higher transit times the BELatency is also very small. Inflight requests on the connection is > 1, there are some request in-pipeline, Mostly 1 or 2 more which ideally should not be an issue. |
Issue started with higher average latencies 5ms -> 20ms, whereas the sample latest diagnostics might be higher tail latencies. @sourabh1007 can you please check out perf runs and any possible correlations? (workload is query and BoundedStaleness) |
Describe the bug
After upgrading package Microsoft.Azure.Cosmos from version 3.31.0 to 3.31.1, the average response time of our API drastically increases. This causes our web application to become very slow and unusable for our customers.
To Reproduce
Upgrade package Microsoft.Azure.Cosmos from version 3.31.0 to 3.31.1.
Expected behavior
We expect a patch version upgrade not to impact the performance negatively.
Actual behavior
The patch version upgrade drastically increases the average response time. I am unable to upload a picture for some reason, but we can clearly see the difference between the average response time with and without the package upgrade. Before we upgraded, the average response time was stable below 5 ms. After we upgraded Microsoft.Azure.Cosmos to 3.31.1, it would rise to around 20 ms. After we downgraded it again to 3.31.0, we can clearly see that the average response time is again below 5 ms.
The average response time may seem relatively low. The reason is that this graph is from a test environment. There is a low amount of activity, but as a result, the differences are clearly visualized. On a production environment, the average response times are much higher after the package upgrade (the average is around 1 second). A production environment has hundreds of users and as a result our application becomes unusably slow.
Workaround
The issue we encounter seems the same as the one reported in #3613. This issue has been closed in the meantime. The symptoms that we encounter seem to be the same that they have reported. The solution that this issue has been closed with was to upgrade the app service plan from P1V2 to P1V3. We have tried to upgrade the app service plan as well on our test environment. There, we saw that this also solves the issue for our application. However, upgrading the plan a single tier doubles the cost and as such we do not see this as a viable option. Similar to the reported issue, we do not see why the change in plan matters, since we do not see that cpu or memory of the machine is a bottleneck.
Environment summary
SDK: .NET8
App service plan: Premium v3 P0V3
Additional context
The package version of Microsoft.Azure.Cosmos 3.31.0 is over 2 years old and several new versions have come out in the meantime. We have tried to use newer versions as well in the hopes that the issue might have been mixed in the meantime. Unfortunately, we can still reproduce the issue with version 3.45.0.
I have compared the changes done in the repository between version 3.31.0 and 3.31.1 and the only change that happened was the upgrade of Microsoft.Azure.Cosmos.Direct from 3.29.1 to 3.29.4. releases/3.31.0...releases/3.31.1
I have also looked at the changes done in the Microsoft.Azure.Cosmos.Direct package. Due to the large amount of changes, I cannot make sense of this. https://github.com/Azure/azure-cosmos-dotnet-v3/commits/msdata/direct/Microsoft.Azure.Cosmos/src/direct
The text was updated successfully, but these errors were encountered: