-
Notifications
You must be signed in to change notification settings - Fork 138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixes Two Flaky IT classes RestMLGuardrailsIT & ToolIntegrationWithLLMTest #3253
Fixes Two Flaky IT classes RestMLGuardrailsIT & ToolIntegrationWithLLMTest #3253
Conversation
Following opensearch-project#3244 this IT called the task api to check the model id again however this is redundant. Instead one can directly pull the model_id upon creating the model group. Manual testing was done to see that the behavior is intact, this should help reduce the calls within a IT to make it less flaky Signed-off-by: Brian Flores <[email protected]>
Previously the test class attempted to delete a model without fully knowing if the model was undeployed in time. This change adds a waiting for 5 retries each 1 second to check the status of the model and only when undeployed will it proceed to delete the model. When the number of retries are exceeded it throws a error indicating a deeper problem. Manual testing was done to check that the model is undeployed by searching for the specific model via the checkForModelUndeployedStatus method. Signed-off-by: Brian Flores <[email protected]>
deleteModel(client(), modelId, null); | ||
} | ||
|
||
@SneakyThrows | ||
private void waitModelUndeployed(String modelId) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like in this PR we removed this line as this PR should fix this issue? Isn't that the case?
More background about this test here: #2435 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for bringing that up, Im checking the communication thread and notice that Zane (@zane-neo) was trying to fix the issue after the aforementioned PR that omitted the waitModelUndeployed
method.
In summary I see the following happened (Based on this PR Issue thread)
- (May 9th) Looks like @b4sjoo Authored a PR that fixes this class issue by updating the transportUndeployAction class. By using a callback to notify when the undeployement occured.
- ([July 4th] (Fix model delete failed after model undeployed #2613)) Zane references a issue following the same problem and the PR thread opens again. In their own words
The reason is deploy model is in an async approach and when client received the response, the model status in index is not guaranteed to deployed, this doesn't have a simple fix since a lot of files are involved in deploy flow.
This issue is not able to reproduced easily on local, .. and closing this PR now. src
What i believe this means is that because we are working with an Async operation we have to include this method to periodically check that the model has been undeployed before deleting it to get over the {"error":{"root_cause":[{"type":"status_exception","reason":"Model cannot be deleted in deploying or deployed state. Try undeploy model first then delete"}],"type":"status_exception","reason":"Model cannot be deleted in deploying or deployed state. Try undeploy model first then delete"},"status":400}
issue. We wouldn't be able to know when the listener was able to respond back with ideal status.
@@ -977,7 +977,7 @@ public void waitForTask(String taskId, MLTaskState targetState) throws Interrupt | |||
} | |||
return taskDone.get(); | |||
}, CUSTOM_MODEL_TIMEOUT, TimeUnit.SECONDS); | |||
assertTrue(taskDone.get()); | |||
assertTrue(String.format(Locale.ROOT, "Task Id %s could not get to %s state", taskId, targetState.name()), taskDone.get()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool!
The backport to
To backport manually, run these commands in your terminal: # Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add .worktrees/backport-2.x 2.x
# Navigate to the new working tree
cd .worktrees/backport-2.x
# Create a new branch
git switch --create backport/backport-3253-to-2.x
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 1a659c8048dd23cdab3ab881622bf8cd355e1631
# Push it to GitHub
git push --set-upstream origin backport/backport-3253-to-2.x
# Go back to the original working tree
cd ../..
# Delete the working tree
git worktree remove .worktrees/backport-2.x Then, create a pull request where the |
…MTest (opensearch-project#3253) * fix uneeded call to get model_id for task api within RestMLGuardrailsIT Following opensearch-project#3244 this IT called the task api to check the model id again however this is redundant. Instead one can directly pull the model_id upon creating the model group. Manual testing was done to see that the behavior is intact, this should help reduce the calls within a IT to make it less flaky Signed-off-by: Brian Flores <[email protected]> * fix ToolIntegrationWithLLMTest model undeploy race condition Previously the test class attempted to delete a model without fully knowing if the model was undeployed in time. This change adds a waiting for 5 retries each 1 second to check the status of the model and only when undeployed will it proceed to delete the model. When the number of retries are exceeded it throws a error indicating a deeper problem. Manual testing was done to check that the model is undeployed by searching for the specific model via the checkForModelUndeployedStatus method. Signed-off-by: Brian Flores <[email protected]> --------- Signed-off-by: Brian Flores <[email protected]> (cherry picked from commit 1a659c8)
…MTest (opensearch-project#3253) * fix uneeded call to get model_id for task api within RestMLGuardrailsIT Following opensearch-project#3244 this IT called the task api to check the model id again however this is redundant. Instead one can directly pull the model_id upon creating the model group. Manual testing was done to see that the behavior is intact, this should help reduce the calls within a IT to make it less flaky Signed-off-by: Brian Flores <[email protected]> * fix ToolIntegrationWithLLMTest model undeploy race condition Previously the test class attempted to delete a model without fully knowing if the model was undeployed in time. This change adds a waiting for 5 retries each 1 second to check the status of the model and only when undeployed will it proceed to delete the model. When the number of retries are exceeded it throws a error indicating a deeper problem. Manual testing was done to check that the model is undeployed by searching for the specific model via the checkForModelUndeployedStatus method. Signed-off-by: Brian Flores <[email protected]> --------- Signed-off-by: Brian Flores <[email protected]> (cherry picked from commit 1a659c8)
…MTest (opensearch-project#3253) * fix uneeded call to get model_id for task api within RestMLGuardrailsIT Following opensearch-project#3244 this IT called the task api to check the model id again however this is redundant. Instead one can directly pull the model_id upon creating the model group. Manual testing was done to see that the behavior is intact, this should help reduce the calls within a IT to make it less flaky Signed-off-by: Brian Flores <[email protected]> * fix ToolIntegrationWithLLMTest model undeploy race condition Previously the test class attempted to delete a model without fully knowing if the model was undeployed in time. This change adds a waiting for 5 retries each 1 second to check the status of the model and only when undeployed will it proceed to delete the model. When the number of retries are exceeded it throws a error indicating a deeper problem. Manual testing was done to check that the model is undeployed by searching for the specific model via the checkForModelUndeployedStatus method. Signed-off-by: Brian Flores <[email protected]> --------- Signed-off-by: Brian Flores <[email protected]>
…MTest (opensearch-project#3253) * fix uneeded call to get model_id for task api within RestMLGuardrailsIT Following opensearch-project#3244 this IT called the task api to check the model id again however this is redundant. Instead one can directly pull the model_id upon creating the model group. Manual testing was done to see that the behavior is intact, this should help reduce the calls within a IT to make it less flaky Signed-off-by: Brian Flores <[email protected]> * fix ToolIntegrationWithLLMTest model undeploy race condition Previously the test class attempted to delete a model without fully knowing if the model was undeployed in time. This change adds a waiting for 5 retries each 1 second to check the status of the model and only when undeployed will it proceed to delete the model. When the number of retries are exceeded it throws a error indicating a deeper problem. Manual testing was done to check that the model is undeployed by searching for the specific model via the checkForModelUndeployedStatus method. Signed-off-by: Brian Flores <[email protected]> --------- Signed-off-by: Brian Flores <[email protected]> Signed-off-by: tkykenmt <[email protected]>
…tegrationWithLLMTest (#3263) * Fixes Two Flaky IT classes RestMLGuardrailsIT & ToolIntegrationWithLLMTest (#3253) * fix uneeded call to get model_id for task api within RestMLGuardrailsIT Following #3244 this IT called the task api to check the model id again however this is redundant. Instead one can directly pull the model_id upon creating the model group. Manual testing was done to see that the behavior is intact, this should help reduce the calls within a IT to make it less flaky Signed-off-by: Brian Flores <[email protected]> * fix ToolIntegrationWithLLMTest model undeploy race condition Previously the test class attempted to delete a model without fully knowing if the model was undeployed in time. This change adds a waiting for 5 retries each 1 second to check the status of the model and only when undeployed will it proceed to delete the model. When the number of retries are exceeded it throws a error indicating a deeper problem. Manual testing was done to check that the model is undeployed by searching for the specific model via the checkForModelUndeployedStatus method. Signed-off-by: Brian Flores <[email protected]> --------- Signed-off-by: Brian Flores <[email protected]> (cherry picked from commit 1a659c8) * add retry according to how many rest clients are in a IT cluster Signed-off-by: Brian Flores <[email protected]> * fix retry initialization The MAX_RETRIES variable had to wait for the cluster to form before it could call to get the cluster size Signed-off-by: Brian Flores <[email protected]> --------- Signed-off-by: Brian Flores <[email protected]>
Description
This PR attempts to remedy two Flaky classes
GET /_plugins/_ml/models/ modelId
and actively blocking after 5 tries to make sure the model successfully undeployed before deleting which was the crux of the issue (Model cannot be deleted in deploying or deployed state
).This PR also does some code improvement by more descriptive logging, variable naming, spacing.
Related Issues
Resolves #3237
Check List
--signoff
.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.