Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes Two Flaky IT classes RestMLGuardrailsIT & ToolIntegrationWithLLMTest #3253

Merged

Conversation

brianf-aws
Copy link
Contributor

@brianf-aws brianf-aws commented Dec 5, 2024

Description

This PR attempts to remedy two Flaky classes

This PR also does some code improvement by more descriptive logging, variable naming, spacing.

Related Issues

Resolves #3237

Check List

  • New functionality includes testing.
  • New functionality has been documented.
  • API changes companion pull request created.
  • Commits are signed per the DCO using --signoff.
  • Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Following opensearch-project#3244 this IT called the task api to check the model id again however this is redundant. Instead one can directly pull the model_id upon creating the model group. Manual testing was done to see that the behavior is intact, this should help reduce the calls within a IT to make it less flaky

Signed-off-by: Brian Flores <[email protected]>
Previously the test class attempted to delete a model without fully knowing if the model was undeployed in time. This change adds a waiting for 5 retries each 1 second to check the status of the model and only when undeployed will it proceed to delete the model. When the number of retries are exceeded it throws a error indicating a deeper problem. Manual testing was done to check that the model is undeployed by searching for the specific model via the checkForModelUndeployedStatus method.

Signed-off-by: Brian Flores <[email protected]>
deleteModel(client(), modelId, null);
}

@SneakyThrows
private void waitModelUndeployed(String modelId) {
Copy link
Collaborator

@dhrubo-os dhrubo-os Dec 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like in this PR we removed this line as this PR should fix this issue? Isn't that the case?

More background about this test here: #2435 (comment)

Copy link
Contributor Author

@brianf-aws brianf-aws Dec 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for bringing that up, Im checking the communication thread and notice that Zane (@zane-neo) was trying to fix the issue after the aforementioned PR that omitted the waitModelUndeployed method.

In summary I see the following happened (Based on this PR Issue thread)

  1. (May 9th) Looks like @b4sjoo Authored a PR that fixes this class issue by updating the transportUndeployAction class. By using a callback to notify when the undeployement occured.
  2. ([July 4th] (Fix model delete failed after model undeployed #2613)) Zane references a issue following the same problem and the PR thread opens again. In their own words

The reason is deploy model is in an async approach and when client received the response, the model status in index is not guaranteed to deployed, this doesn't have a simple fix since a lot of files are involved in deploy flow.
This issue is not able to reproduced easily on local, .. and closing this PR now. src

What i believe this means is that because we are working with an Async operation we have to include this method to periodically check that the model has been undeployed before deleting it to get over the {"error":{"root_cause":[{"type":"status_exception","reason":"Model cannot be deleted in deploying or deployed state. Try undeploy model first then delete"}],"type":"status_exception","reason":"Model cannot be deleted in deploying or deployed state. Try undeploy model first then delete"},"status":400} issue. We wouldn't be able to know when the listener was able to respond back with ideal status.

@@ -977,7 +977,7 @@ public void waitForTask(String taskId, MLTaskState targetState) throws Interrupt
}
return taskDone.get();
}, CUSTOM_MODEL_TIMEOUT, TimeUnit.SECONDS);
assertTrue(taskDone.get());
assertTrue(String.format(Locale.ROOT, "Task Id %s could not get to %s state", taskId, targetState.name()), taskDone.get());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool!

@brianf-aws brianf-aws temporarily deployed to ml-commons-cicd-env-require-approval December 5, 2024 07:14 — with GitHub Actions Inactive
@dhrubo-os dhrubo-os merged commit 1a659c8 into opensearch-project:main Dec 6, 2024
9 checks passed
@opensearch-trigger-bot
Copy link
Contributor

The backport to 2.x failed:

The process '/usr/bin/git' failed with exit code 1

To backport manually, run these commands in your terminal:

# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add .worktrees/backport-2.x 2.x
# Navigate to the new working tree
cd .worktrees/backport-2.x
# Create a new branch
git switch --create backport/backport-3253-to-2.x
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 1a659c8048dd23cdab3ab881622bf8cd355e1631
# Push it to GitHub
git push --set-upstream origin backport/backport-3253-to-2.x
# Go back to the original working tree
cd ../..
# Delete the working tree
git worktree remove .worktrees/backport-2.x

Then, create a pull request where the base branch is 2.x and the compare/head branch is backport/backport-3253-to-2.x.

brianf-aws added a commit to brianf-aws/ml-commons that referenced this pull request Dec 9, 2024
…MTest (opensearch-project#3253)

* fix uneeded call to get model_id for task api within RestMLGuardrailsIT

Following opensearch-project#3244 this IT called the task api to check the model id again however this is redundant. Instead one can directly pull the model_id upon creating the model group. Manual testing was done to see that the behavior is intact, this should help reduce the calls within a IT to make it less flaky

Signed-off-by: Brian Flores <[email protected]>

* fix ToolIntegrationWithLLMTest model undeploy race condition

Previously the test class attempted to delete a model without fully knowing if the model was undeployed in time. This change adds a waiting for 5 retries each 1 second to check the status of the model and only when undeployed will it proceed to delete the model. When the number of retries are exceeded it throws a error indicating a deeper problem. Manual testing was done to check that the model is undeployed by searching for the specific model via the checkForModelUndeployedStatus method.

Signed-off-by: Brian Flores <[email protected]>

---------

Signed-off-by: Brian Flores <[email protected]>
(cherry picked from commit 1a659c8)
brianf-aws added a commit to brianf-aws/ml-commons that referenced this pull request Dec 9, 2024
…MTest (opensearch-project#3253)

* fix uneeded call to get model_id for task api within RestMLGuardrailsIT

Following opensearch-project#3244 this IT called the task api to check the model id again however this is redundant. Instead one can directly pull the model_id upon creating the model group. Manual testing was done to see that the behavior is intact, this should help reduce the calls within a IT to make it less flaky

Signed-off-by: Brian Flores <[email protected]>

* fix ToolIntegrationWithLLMTest model undeploy race condition

Previously the test class attempted to delete a model without fully knowing if the model was undeployed in time. This change adds a waiting for 5 retries each 1 second to check the status of the model and only when undeployed will it proceed to delete the model. When the number of retries are exceeded it throws a error indicating a deeper problem. Manual testing was done to check that the model is undeployed by searching for the specific model via the checkForModelUndeployedStatus method.

Signed-off-by: Brian Flores <[email protected]>

---------

Signed-off-by: Brian Flores <[email protected]>
(cherry picked from commit 1a659c8)
tkykenmt pushed a commit to tkykenmt/ml-commons that referenced this pull request Dec 15, 2024
…MTest (opensearch-project#3253)

* fix uneeded call to get model_id for task api within RestMLGuardrailsIT

Following opensearch-project#3244 this IT called the task api to check the model id again however this is redundant. Instead one can directly pull the model_id upon creating the model group. Manual testing was done to see that the behavior is intact, this should help reduce the calls within a IT to make it less flaky

Signed-off-by: Brian Flores <[email protected]>

* fix ToolIntegrationWithLLMTest model undeploy race condition

Previously the test class attempted to delete a model without fully knowing if the model was undeployed in time. This change adds a waiting for 5 retries each 1 second to check the status of the model and only when undeployed will it proceed to delete the model. When the number of retries are exceeded it throws a error indicating a deeper problem. Manual testing was done to check that the model is undeployed by searching for the specific model via the checkForModelUndeployedStatus method.

Signed-off-by: Brian Flores <[email protected]>

---------

Signed-off-by: Brian Flores <[email protected]>
tkykenmt pushed a commit to tkykenmt/ml-commons that referenced this pull request Dec 15, 2024
…MTest (opensearch-project#3253)

* fix uneeded call to get model_id for task api within RestMLGuardrailsIT

Following opensearch-project#3244 this IT called the task api to check the model id again however this is redundant. Instead one can directly pull the model_id upon creating the model group. Manual testing was done to see that the behavior is intact, this should help reduce the calls within a IT to make it less flaky

Signed-off-by: Brian Flores <[email protected]>

* fix ToolIntegrationWithLLMTest model undeploy race condition

Previously the test class attempted to delete a model without fully knowing if the model was undeployed in time. This change adds a waiting for 5 retries each 1 second to check the status of the model and only when undeployed will it proceed to delete the model. When the number of retries are exceeded it throws a error indicating a deeper problem. Manual testing was done to check that the model is undeployed by searching for the specific model via the checkForModelUndeployedStatus method.

Signed-off-by: Brian Flores <[email protected]>

---------

Signed-off-by: Brian Flores <[email protected]>
Signed-off-by: tkykenmt <[email protected]>
dhrubo-os pushed a commit that referenced this pull request Dec 31, 2024
…tegrationWithLLMTest (#3263)

* Fixes Two Flaky IT classes RestMLGuardrailsIT & ToolIntegrationWithLLMTest (#3253)

* fix uneeded call to get model_id for task api within RestMLGuardrailsIT

Following #3244 this IT called the task api to check the model id again however this is redundant. Instead one can directly pull the model_id upon creating the model group. Manual testing was done to see that the behavior is intact, this should help reduce the calls within a IT to make it less flaky

Signed-off-by: Brian Flores <[email protected]>

* fix ToolIntegrationWithLLMTest model undeploy race condition

Previously the test class attempted to delete a model without fully knowing if the model was undeployed in time. This change adds a waiting for 5 retries each 1 second to check the status of the model and only when undeployed will it proceed to delete the model. When the number of retries are exceeded it throws a error indicating a deeper problem. Manual testing was done to check that the model is undeployed by searching for the specific model via the checkForModelUndeployedStatus method.

Signed-off-by: Brian Flores <[email protected]>

---------

Signed-off-by: Brian Flores <[email protected]>
(cherry picked from commit 1a659c8)

* add retry according to how many rest clients are in a IT cluster

Signed-off-by: Brian Flores <[email protected]>

* fix retry initialization

The MAX_RETRIES variable had to wait for the cluster to form before it could call to get the cluster size

Signed-off-by: Brian Flores <[email protected]>

---------

Signed-off-by: Brian Flores <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG]-(flaky tests) ITs involving models have race conditions
4 participants