Fixes Two Flaky IT classes RestMLGuardrailsIT & ToolIntegrationWithLLMTest #3253

brianf-aws · 2024-12-05T02:02:41Z

Description

This PR attempts to remedy two Flaky classes

RestMLGuardrailsIT.java Borrows code cleanup from Retrieve remote model id from registration response in IT to avoid flaky #3244 Which removes an extra call to the Task API to get the model id.
ToolIntegrationWithLLMTest.java updates the class to specifically check for the status of the model by calling GET /_plugins/_ml/models/ modelId and actively blocking after 5 tries to make sure the model successfully undeployed before deleting which was the crux of the issue (Model cannot be deleted in deploying or deployed state).

This PR also does some code improvement by more descriptive logging, variable naming, spacing.

Related Issues

Resolves #3237

Check List

New functionality includes testing.
New functionality has been documented.
API changes companion pull request created.
Commits are signed per the DCO using --signoff.
Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Following opensearch-project#3244 this IT called the task api to check the model id again however this is redundant. Instead one can directly pull the model_id upon creating the model group. Manual testing was done to see that the behavior is intact, this should help reduce the calls within a IT to make it less flaky Signed-off-by: Brian Flores <[email protected]>

Previously the test class attempted to delete a model without fully knowing if the model was undeployed in time. This change adds a waiting for 5 retries each 1 second to check the status of the model and only when undeployed will it proceed to delete the model. When the number of retries are exceeded it throws a error indicating a deeper problem. Manual testing was done to check that the model is undeployed by searching for the specific model via the checkForModelUndeployedStatus method. Signed-off-by: Brian Flores <[email protected]>

dhrubo-os · 2024-12-05T06:37:56Z

plugin/src/test/java/org/opensearch/ml/tools/ToolIntegrationWithLLMTest.java

        deleteModel(client(), modelId, null);
    }

    @SneakyThrows
-    private void waitModelUndeployed(String modelId) {


Seems like in this PR we removed this line as this PR should fix this issue? Isn't that the case?

More background about this test here: #2435 (comment)

Thanks for bringing that up, Im checking the communication thread and notice that Zane (@zane-neo) was trying to fix the issue after the aforementioned PR that omitted the waitModelUndeployed method.

In summary I see the following happened (Based on this PR Issue thread)

(May 9th) Looks like @b4sjoo Authored a PR that fixes this class issue by updating the transportUndeployAction class. By using a callback to notify when the undeployement occured.

([July 4th] (Fix model delete failed after model undeployed #2613)) Zane references a issue following the same problem and the PR thread opens again. In their own words

The reason is deploy model is in an async approach and when client received the response, the model status in index is not guaranteed to deployed, this doesn't have a simple fix since a lot of files are involved in deploy flow.
This issue is not able to reproduced easily on local, .. and closing this PR now. src

What i believe this means is that because we are working with an Async operation we have to include this method to periodically check that the model has been undeployed before deleting it to get over the {"error":{"root_cause":[{"type":"status_exception","reason":"Model cannot be deleted in deploying or deployed state. Try undeploy model first then delete"}],"type":"status_exception","reason":"Model cannot be deleted in deploying or deployed state. Try undeploy model first then delete"},"status":400} issue. We wouldn't be able to know when the listener was able to respond back with ideal status.

dhrubo-os · 2024-12-05T06:39:03Z

plugin/src/test/java/org/opensearch/ml/rest/MLCommonsRestTestCase.java

@@ -977,7 +977,7 @@ public void waitForTask(String taskId, MLTaskState targetState) throws Interrupt
            }
            return taskDone.get();
        }, CUSTOM_MODEL_TIMEOUT, TimeUnit.SECONDS);
-        assertTrue(taskDone.get());
+        assertTrue(String.format(Locale.ROOT, "Task Id %s could not get to %s state", taskId, targetState.name()), taskDone.get());


opensearch-trigger-bot · 2024-12-06T23:44:36Z

The backport to 2.x failed:

The process '/usr/bin/git' failed with exit code 1

To backport manually, run these commands in your terminal:

# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add .worktrees/backport-2.x 2.x
# Navigate to the new working tree
cd .worktrees/backport-2.x
# Create a new branch
git switch --create backport/backport-3253-to-2.x
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 1a659c8048dd23cdab3ab881622bf8cd355e1631
# Push it to GitHub
git push --set-upstream origin backport/backport-3253-to-2.x
# Go back to the original working tree
cd ../..
# Delete the working tree
git worktree remove .worktrees/backport-2.x

Then, create a pull request where the base branch is 2.x and the compare/head branch is backport/backport-3253-to-2.x.

…MTest (opensearch-project#3253) * fix uneeded call to get model_id for task api within RestMLGuardrailsIT Following opensearch-project#3244 this IT called the task api to check the model id again however this is redundant. Instead one can directly pull the model_id upon creating the model group. Manual testing was done to see that the behavior is intact, this should help reduce the calls within a IT to make it less flaky Signed-off-by: Brian Flores <[email protected]> * fix ToolIntegrationWithLLMTest model undeploy race condition Previously the test class attempted to delete a model without fully knowing if the model was undeployed in time. This change adds a waiting for 5 retries each 1 second to check the status of the model and only when undeployed will it proceed to delete the model. When the number of retries are exceeded it throws a error indicating a deeper problem. Manual testing was done to check that the model is undeployed by searching for the specific model via the checkForModelUndeployedStatus method. Signed-off-by: Brian Flores <[email protected]> --------- Signed-off-by: Brian Flores <[email protected]> (cherry picked from commit 1a659c8)

…MTest (opensearch-project#3253) * fix uneeded call to get model_id for task api within RestMLGuardrailsIT Following opensearch-project#3244 this IT called the task api to check the model id again however this is redundant. Instead one can directly pull the model_id upon creating the model group. Manual testing was done to see that the behavior is intact, this should help reduce the calls within a IT to make it less flaky Signed-off-by: Brian Flores <[email protected]> * fix ToolIntegrationWithLLMTest model undeploy race condition Previously the test class attempted to delete a model without fully knowing if the model was undeployed in time. This change adds a waiting for 5 retries each 1 second to check the status of the model and only when undeployed will it proceed to delete the model. When the number of retries are exceeded it throws a error indicating a deeper problem. Manual testing was done to check that the model is undeployed by searching for the specific model via the checkForModelUndeployedStatus method. Signed-off-by: Brian Flores <[email protected]> --------- Signed-off-by: Brian Flores <[email protected]>

…MTest (opensearch-project#3253) * fix uneeded call to get model_id for task api within RestMLGuardrailsIT Following opensearch-project#3244 this IT called the task api to check the model id again however this is redundant. Instead one can directly pull the model_id upon creating the model group. Manual testing was done to see that the behavior is intact, this should help reduce the calls within a IT to make it less flaky Signed-off-by: Brian Flores <[email protected]> * fix ToolIntegrationWithLLMTest model undeploy race condition Previously the test class attempted to delete a model without fully knowing if the model was undeployed in time. This change adds a waiting for 5 retries each 1 second to check the status of the model and only when undeployed will it proceed to delete the model. When the number of retries are exceeded it throws a error indicating a deeper problem. Manual testing was done to check that the model is undeployed by searching for the specific model via the checkForModelUndeployedStatus method. Signed-off-by: Brian Flores <[email protected]> --------- Signed-off-by: Brian Flores <[email protected]> Signed-off-by: tkykenmt <[email protected]>

…tegrationWithLLMTest (#3263) * Fixes Two Flaky IT classes RestMLGuardrailsIT & ToolIntegrationWithLLMTest (#3253) * fix uneeded call to get model_id for task api within RestMLGuardrailsIT Following #3244 this IT called the task api to check the model id again however this is redundant. Instead one can directly pull the model_id upon creating the model group. Manual testing was done to see that the behavior is intact, this should help reduce the calls within a IT to make it less flaky Signed-off-by: Brian Flores <[email protected]> * fix ToolIntegrationWithLLMTest model undeploy race condition Previously the test class attempted to delete a model without fully knowing if the model was undeployed in time. This change adds a waiting for 5 retries each 1 second to check the status of the model and only when undeployed will it proceed to delete the model. When the number of retries are exceeded it throws a error indicating a deeper problem. Manual testing was done to check that the model is undeployed by searching for the specific model via the checkForModelUndeployedStatus method. Signed-off-by: Brian Flores <[email protected]> --------- Signed-off-by: Brian Flores <[email protected]> (cherry picked from commit 1a659c8) * add retry according to how many rest clients are in a IT cluster Signed-off-by: Brian Flores <[email protected]> * fix retry initialization The MAX_RETRIES variable had to wait for the cluster to form before it could call to get the cluster size Signed-off-by: Brian Flores <[email protected]> --------- Signed-off-by: Brian Flores <[email protected]>

brianf-aws added 2 commits December 3, 2024 20:22

brianf-aws requested review from b4sjoo, dhrubo-os, jngz-es, model-collapse, rbhavna, ylwu-amzn, zane-neo, Zhangxunmt, austintlee, HenryL27 and xinyual as code owners December 5, 2024 02:02

brianf-aws temporarily deployed to ml-commons-cicd-env-require-approval December 5, 2024 02:03 — with GitHub Actions Inactive

dhrubo-os reviewed Dec 5, 2024

View reviewed changes

brianf-aws temporarily deployed to ml-commons-cicd-env-require-approval December 5, 2024 07:14 — with GitHub Actions Inactive

dhrubo-os approved these changes Dec 5, 2024

View reviewed changes

pyek-bot approved these changes Dec 6, 2024

View reviewed changes

dhrubo-os added the backport 2.x label Dec 6, 2024

mingshl approved these changes Dec 6, 2024

View reviewed changes

dhrubo-os merged commit 1a659c8 into opensearch-project:main Dec 6, 2024
9 checks passed

brianf-aws mentioned this pull request Dec 9, 2024

(Backport 2.x) Fixes Two Flaky IT classes RestMLGuardrailsIT & ToolIntegrationWithLLMTest #3263

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixes Two Flaky IT classes RestMLGuardrailsIT & ToolIntegrationWithLLMTest #3253

Fixes Two Flaky IT classes RestMLGuardrailsIT & ToolIntegrationWithLLMTest #3253

brianf-aws commented Dec 5, 2024 •

edited

Loading

dhrubo-os Dec 5, 2024 •

edited

Loading

brianf-aws Dec 5, 2024 •

edited

Loading

dhrubo-os Dec 5, 2024

opensearch-trigger-bot bot commented Dec 6, 2024

Fixes Two Flaky IT classes RestMLGuardrailsIT & ToolIntegrationWithLLMTest #3253

Fixes Two Flaky IT classes RestMLGuardrailsIT & ToolIntegrationWithLLMTest #3253

Conversation

brianf-aws commented Dec 5, 2024 • edited Loading

Description

Related Issues

Check List

dhrubo-os Dec 5, 2024 • edited Loading

Choose a reason for hiding this comment

brianf-aws Dec 5, 2024 • edited Loading

Choose a reason for hiding this comment

dhrubo-os Dec 5, 2024

Choose a reason for hiding this comment

opensearch-trigger-bot bot commented Dec 6, 2024

brianf-aws commented Dec 5, 2024 •

edited

Loading

dhrubo-os Dec 5, 2024 •

edited

Loading

brianf-aws Dec 5, 2024 •

edited

Loading