Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance decrease when upgrade from 3.31.0 to 3.31.1 #4906

Open
LouisSikkes1 opened this issue Nov 26, 2024 · 13 comments
Open

Performance decrease when upgrade from 3.31.0 to 3.31.1 #4906

LouisSikkes1 opened this issue Nov 26, 2024 · 13 comments
Labels
customer-reported Issue created by a customer needs-investigation

Comments

@LouisSikkes1
Copy link

Describe the bug

After upgrading package Microsoft.Azure.Cosmos from version 3.31.0 to 3.31.1, the average response time of our API drastically increases. This causes our web application to become very slow and unusable for our customers.

To Reproduce

Upgrade package Microsoft.Azure.Cosmos from version 3.31.0 to 3.31.1.

Expected behavior

We expect a patch version upgrade not to impact the performance negatively.

Actual behavior

The patch version upgrade drastically increases the average response time. I am unable to upload a picture for some reason, but we can clearly see the difference between the average response time with and without the package upgrade. Before we upgraded, the average response time was stable below 5 ms. After we upgraded Microsoft.Azure.Cosmos to 3.31.1, it would rise to around 20 ms. After we downgraded it again to 3.31.0, we can clearly see that the average response time is again below 5 ms.

The average response time may seem relatively low. The reason is that this graph is from a test environment. There is a low amount of activity, but as a result, the differences are clearly visualized. On a production environment, the average response times are much higher after the package upgrade (the average is around 1 second). A production environment has hundreds of users and as a result our application becomes unusably slow.

Workaround

The issue we encounter seems the same as the one reported in #3613. This issue has been closed in the meantime. The symptoms that we encounter seem to be the same that they have reported. The solution that this issue has been closed with was to upgrade the app service plan from P1V2 to P1V3. We have tried to upgrade the app service plan as well on our test environment. There, we saw that this also solves the issue for our application. However, upgrading the plan a single tier doubles the cost and as such we do not see this as a viable option. Similar to the reported issue, we do not see why the change in plan matters, since we do not see that cpu or memory of the machine is a bottleneck.

Environment summary

SDK: .NET8
App service plan: Premium v3 P0V3

Additional context

  1. The package version of Microsoft.Azure.Cosmos 3.31.0 is over 2 years old and several new versions have come out in the meantime. We have tried to use newer versions as well in the hopes that the issue might have been mixed in the meantime. Unfortunately, we can still reproduce the issue with version 3.45.0.

  2. I have compared the changes done in the repository between version 3.31.0 and 3.31.1 and the only change that happened was the upgrade of Microsoft.Azure.Cosmos.Direct from 3.29.1 to 3.29.4. releases/3.31.0...releases/3.31.1
    I have also looked at the changes done in the Microsoft.Azure.Cosmos.Direct package. Due to the large amount of changes, I cannot make sense of this. https://github.com/Azure/azure-cosmos-dotnet-v3/commits/msdata/direct/Microsoft.Azure.Cosmos/src/direct

@sourabh1007
Copy link
Contributor

can you please share the diagnostic string for patch operation with slower and faster sdk version. Maybe we can compare them by putting side by side and find out, why you are getting high latency.

@LouisSikkes1
Copy link
Author

I have gathered some diagnostics by implementing the solution mentioned on this post (1), which was linked in #3613. This has gathered data for 3+ days, in which only 3 calls to cosmos reported to have taken more than 500 ms, see zip below. In the average duration of our requests, I can still see clearly that the problem is occurring from our logging when looking at the request duration.

(1) https://learn.microsoft.com/en-us/azure/cosmos-db/nosql/troubleshoot-dotnet-sdk-slow-request?tabs=cpu-new#capture-diagnostics

cosmos_diagnostics.zip

@LouisSikkes1
Copy link
Author

LouisSikkes1 commented Dec 18, 2024

I have done an investigation to further pinpoint the source of the issue. I have slowly removed parts of our code base related to CosmosDB to determine when the symptoms stopped showing. After doing this, it became clear that only having open a connection with the CosmosDB is enough to produce the latency problems. This means the problem is not related to any queries or such that we do. The code is following the best practices outlined in the documentation regarding only creating a single client throughout the apps lifetime. (1)

I am currently looking into any configuration options that can be passed when setting up the connection. I was wondering if this problem is known and if you have any recommendations for the type of machine that we are using. As mentioned in the initial description, our app is hosted on Azure using the 'Premium v3 P0V3' plan. The symptoms mentioned are still present when using the default options as follows:
var cosmosClient = new CosmosClient(_options.AccountEndpoint, _options.AccountKey);

Note: I also read in the best practices document (1) that it is recommended to use a machine that has at least 8-GB memory and 4-cores. I realize that the 'Premium v3 P0V3' does not fulfill these requirements. However, as mentioned before, a machine that does is at least twice as expensive as the one that we are currently using so I am trying to avoid having to scale up.

(1) https://learn.microsoft.com/en-us/azure/cosmos-db/nosql/best-practice-dotnet

@LouisSikkes1
Copy link
Author

I do not have access to these links

@FabianMeiswinkel
Copy link
Member

I do not have access to these links

Understood - added them for engineer looking into this

@kirankumarkolli
Copy link
Member

@LouisSikkes1 the attached diagnostics ZIP file is showing as not valid. Can you please re-add again?

@LouisSikkes1
Copy link
Author

For me it's working fine if I download it. Nonetheless, here is the zip again.

cosmos_diagnostics.zip

@kirankumarkolli
Copy link
Member

Am still getting
Image

@FabianMeiswinkel, @sourabh1007 can you please try?

@sourabh1007
Copy link
Contributor

Not accessible for me also.

@rysiekg
Copy link

rysiekg commented Jan 7, 2025

it looks like the logs are in rar format and not zip. If you change the extension to rar, you can extract it. The backend latency is very low. In 2 files I see pretty big transit time which would indicate a network issue:
"durationInMs": 805.3556, "event": "Transit Time"
The last one is showing the most time in received: "durationInMs": 838.6681, "event": "Received",

@kirankumarkolli
Copy link
Member

thanks @rysiekg, event stages are publicly documented at https://learn.microsoft.com/en-us/azure/cosmos-db/nosql/troubleshoot-dotnet-sdk-slow-request?tabs=cpu-new#requesttimeline

  • Transit time: The time spent on the network after the request was written on the TCP socket. Compare this number to the BELatencyInMs. If BELatencyInMs is small, then the time was spent on the network, and not on the Azure Cosmos DB service. If the request failed with a timeout, it indicates how long the client waited without response, and the source is network latency.
  • Received: The time between the response was received and processed by the SDK. A large value is normally caused by a thread starvation or locked threads.

For requests with higher transit times the BELatency is also very small.
threadWaitIntervalInMs: Test task schedular sampling is not showing much of starvation/contention.

Inflight requests on the connection is > 1, there are some request in-pipeline, Mostly 1 or 2 more which ideally should not be an issue.

@kirankumarkolli
Copy link
Member

Issue started with higher average latencies 5ms -> 20ms, whereas the sample latest diagnostics might be higher tail latencies. @sourabh1007 can you please check out perf runs and any possible correlations? (workload is query and BoundedStaleness)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
customer-reported Issue created by a customer needs-investigation
Projects
None yet
Development

No branches or pull requests

5 participants