(Backport 2.x) Fixes Two Flaky IT classes RestMLGuardrailsIT & ToolIntegrationWithLLMTest #3263

brianf-aws · 2024-12-09T19:32:07Z

This backport (2.x) is to improve two Flaky classes (RestMLGuardrailsIT, ToolIntegrationWithLLMTest). This was a result of #3253

Testing

./gradlew :opensearch-ml-plugin:compileJava
./gradlew test

ylwu-amzn · 2024-12-09T21:02:07Z

org.opensearch.ml.rest.RestMLRemoteInferenceIT > testOpenAITextEmbeddingModel_ISO8859_1 FAILED

But this UT already removed in PR #3159, @dhrubo-os can you check if this PR backported to 2.x ?

dhrubo-os · 2024-12-09T21:11:22Z

org.opensearch.ml.rest.RestMLRemoteInferenceIT > testOpenAITextEmbeddingModel_ISO8859_1 FAILED

But this UT already removed in PR #3159, @dhrubo-os can you check if this PR backported to 2.x ?

Yes: https://github.com/opensearch-project/ml-commons/pull/3163/files

@brianf-aws do you have all the updates from 2.x branch?

brianf-aws · 2024-12-09T21:21:47Z

Yes good callout I havent updated my local 2.x branch in awhile. will remedy this ASAP

…MTest (opensearch-project#3253) * fix uneeded call to get model_id for task api within RestMLGuardrailsIT Following opensearch-project#3244 this IT called the task api to check the model id again however this is redundant. Instead one can directly pull the model_id upon creating the model group. Manual testing was done to see that the behavior is intact, this should help reduce the calls within a IT to make it less flaky Signed-off-by: Brian Flores <[email protected]> * fix ToolIntegrationWithLLMTest model undeploy race condition Previously the test class attempted to delete a model without fully knowing if the model was undeployed in time. This change adds a waiting for 5 retries each 1 second to check the status of the model and only when undeployed will it proceed to delete the model. When the number of retries are exceeded it throws a error indicating a deeper problem. Manual testing was done to check that the model is undeployed by searching for the specific model via the checkForModelUndeployedStatus method. Signed-off-by: Brian Flores <[email protected]> --------- Signed-off-by: Brian Flores <[email protected]> (cherry picked from commit 1a659c8)

brianf-aws · 2024-12-10T00:30:19Z

I am seeing that this is failing besides with the retry; after researching I can gather the following

...
initializing REST clients against [http://[::1]:40677, http://127.0.0.1:43017,/ http://[::1]:34869, http://127.0.0.1:38417,/ http://[::1]:37987, http://127.0.0.1:40451]/
...
org.opensearch.ml.tools.VisualizationsToolIT > testVisualizationFound STANDARD_OUT
    [2024-12-09T19:36:43,740][INFO ][o.o.m.t.VisualizationsToolIT] [testVisualizationFound] The 1-th attempt on GET:/_plugins/_ml/models/VpWRrZMBriS5AgkXTdYF . response: Response{requestLine=GET /_plugins/_ml/models/VpWRrZMBriS5AgkXTdYF HTTP/1.1, host=http://[::1]:37987, response=HTTP/1.1 200 OK}
    [2024-12-09T19:36:44,742][INFO ][o.o.m.t.VisualizationsToolIT] [testVisualizationFound] The 2-th attempt on GET:/_plugins/_ml/models/VpWRrZMBriS5AgkXTdYF . response: Response{requestLine=GET /_plugins/_ml/models/VpWRrZMBriS5AgkXTdYF HTTP/1.1, host=http://127.0.0.1:38417,/ response=HTTP/1.1 200 OK}
    [2024-12-09T19:36:45,745][INFO ][o.o.m.t.VisualizationsToolIT] [testVisualizationFound] The 3-th attempt on GET:/_plugins/_ml/models/VpWRrZMBriS5AgkXTdYF . response: Response{requestLine=GET /_plugins/_ml/models/VpWRrZMBriS5AgkXTdYF HTTP/1.1, host=http://[::1]:34869, response=HTTP/1.1 200 OK}
    [2024-12-09T19:36:46,749][INFO ][o.o.m.t.VisualizationsToolIT] [testVisualizationFound] The 4-th attempt on GET:/_plugins/_ml/models/VpWRrZMBriS5AgkXTdYF . response: Response{requestLine=GET /_plugins/_ml/models/VpWRrZMBriS5AgkXTdYF HTTP/1.1, host=http://127.0.0.1:43017,/ response=HTTP/1.1 200 OK}
    [2024-12-09T19:36:47,751][INFO ][o.o.m.t.VisualizationsToolIT] [testVisualizationFound] The 5-th attempt on GET:/_plugins/_ml/models/VpWRrZMBriS5AgkXTdYF . response: Response{requestLine=GET /_plugins/_ml/models/VpWRrZMBriS5AgkXTdYF HTTP/1.1, host=http://[::1]:40677, response=HTTP/1.1 200 OK}



org.opensearch.ml.tools.VisualizationsToolIT > testVisualizationFound STANDARD_ERROR
REPRODUCE WITH: ./gradlew ':opensearch-ml-plugin:integTest' --tests "org.opensearch.ml.tools.VisualizationsToolIT.testVisualizationFound" -Dtests.seed=8F817C566A4FD343 -Dtests.security.manager=false -Dtests.locale=ar-DZ -Dtests.timezone=America/Rosario -Druntime.java=21

Here there are 6 rest clients (http://[::1]:40677, http://127.0.0.1:43017,/ http://[::1]:34869, http://127.0.0.1:38417,/ http://[::1]:37987, http://127.0.0.1:40451]) The current amount of retries is hardcoded to 5. My speculation is that the rest client that had the correct result was not hit in time. During this specific test it hits the addresses by port 37987, 38417, 34869, 43017, 40677 (5 in total before throwing the exception it took to long.) Normally when running the reproduce with ... it will launch a simple cluster from what I see is 2 rest clients.

I also checked that the test class VisualizationsToolIT follows the extension flow VisualizationsToolIT -> ToolIntegrationWithLLMTest -> RestBaseAgentToolsIT -> MLCommonsRestTestCase -> OpenSearchRestTestCase -> ... The last class highlighted (OpenSearchRestTestCase) states that it is used against a external cluster. This makes it harder to test locally as the Github CI is running a config unknown to us at runtime.

There exists a method within OpenSearchTestCase

    protected final List<HttpHost> getClusterHosts() {
        return clusterHosts;
    }

Which we can use as the amount of retires instead of a hardcoded amount of 5. This way we can account for multinode clusters of any length.

To further this point On the main branch if you look at a recent build pass you will see that this specific test ran a cluster with 2 rest clients and passed

VisualizationsToolIT > testVisualizationFound STANDARD_OUT
    [2024-12-07T01:14:53,093][INFO ][o.o.m.t.VisualizationsToolIT] [testVisualizationFound] before test
    [2024-12-07T01:14:53,098][INFO ][o.o.m.t.VisualizationsToolIT] [testVisualizationFound] initializing REST clients against [http://[::1]:39355, http://127.0.0.1:44083]/
    [2024-12-07T01:14:54,466][INFO ][o.o.m.t.ToolIntegrationWithLLMTest] [testVisualizationFound] model_id: oPR4npMB3nb1h2n5Ia7C, agent_id: ovR4npMB3nb1h2n5Ja75

VisualizationsToolIT > testVisualizationFound STANDARD_ERROR
    dec. 07, 2024 1:14:54 AM org.opensearch.client.RestClient logResponse
    WARNING: request [POST http://[::1]:39355/.kibana/_doc/d22f6bee-71fb-422e-9d87-b1cb6b20b042?refresh=true] returned 1 warnings: [299 OpenSearch-3.0.0-SNAPSHOT-75a2fc3629260bb140e38368b5afb21f78345e79 "index name [.kibana] starts with a dot '.', in the next major version, index names starting with a dot are reserved for hidden indices and system indices"]

VisualizationsToolIT > testVisualizationFound STANDARD_OUT
    [2024-12-07T01:14:55,968][INFO ][o.o.m.t.VisualizationsToolIT] [testVisualizationFound] after test

VisualizationsToolIT > testVisualizationNotFound STANDARD_OUT
    [2024-12-07T01:14:55,971][INFO ][o.o.m.t.VisualizationsToolIT] [testVisualizationNotFound] before test
    [2024-12-07T01:14:57,327][INFO ][o.o.m.t.ToolIntegrationWithLLMTest] [testVisualizationNotFound] model_id: q_R4npMB3nb1h2n5LK7u, agent_id: rfR4npMB3nb1h2n5Ma4n
    [2024-12-07T01:14:58,661][INFO ][o.o.m.t.VisualizationsToolIT] [testVisualizationNotFound] after test

Signed-off-by: Brian Flores <[email protected]>

The MAX_RETRIES variable had to wait for the cluster to form before it could call to get the cluster size Signed-off-by: Brian Flores <[email protected]>

mingshl · 2024-12-30T22:06:58Z

REPRODUCE WITH: gradlew ':opensearch-ml-plugin:test' --tests "org.opensearch.ml.action.models.GetModelITTests.testGetModel_NullModelIdException" -Dtests.seed=4B0F5237C93E050D -Dtests.security.manager=false -Dtests.locale=id-ID -Dtests.timezone=Asia/Ust-Nera -Druntime.java=21
org.opensearch.ml.action.models.GetModelITTests > testGetModel_NullModelIdException FAILED
    OpenSearchTimeoutException[java.util.concurrent.TimeoutException: Timeout waiting for task.]; nested: TimeoutException[Timeout waiting for task.];
        at __randomizedtesting.SeedInfo.seed([4B0F5237C93E050D:438E9DB887F36DF8]:0)
        at app//org.opensearch.common.util.concurrent.FutureUtils.get(FutureUtils.java:96)
        at app//org.opensearch.action.support.AdapterActionFuture.actionGet(AdapterActionFuture.java:79)
        at app//org.opensearch.action.support.AdapterActionFuture.actionGet(AdapterActionFuture.java:68)
        at app//org.opensearch.ml.action.MLCommonsIntegTestCase.loadIrisData(MLCommonsIntegTestCase.java:142)
        at app//org.opensearch.ml.action.models.GetModelITTests.setUp(GetModelITTests.java:29)

        Caused by:
        java.util.concurrent.TimeoutException: Timeout waiting for task.
            at org.opensearch.common.util.concurrent.BaseFuture$Sync.get(BaseFuture.java:257)
            at org.opensearch.common.util.concurrent.BaseFuture.get(BaseFuture.java:82)
            at org.opensearch.common.util.concurrent.FutureUtils.get(FutureUtils.java:94)
            ... 4 more

brianf-aws requested review from b4sjoo, dhrubo-os, jngz-es, model-collapse, rbhavna, ylwu-amzn, zane-neo, Zhangxunmt, austintlee and HenryL27 as code owners December 9, 2024 19:32

brianf-aws had a problem deploying to ml-commons-cicd-env-require-approval December 9, 2024 19:32 — with GitHub Actions Error

brianf-aws had a problem deploying to ml-commons-cicd-env-require-approval December 9, 2024 19:32 — with GitHub Actions Failure

brianf-aws had a problem deploying to ml-commons-cicd-env-require-approval December 9, 2024 19:32 — with GitHub Actions Error

brianf-aws mentioned this pull request Dec 9, 2024

[BUG]-(flaky tests) ITs involving models have race conditions #3237

Closed

brianf-aws force-pushed the backport/backport-3253-to-2.x branch from 7fdd0f2 to ee4845e Compare December 9, 2024 21:28

brianf-aws temporarily deployed to ml-commons-cicd-env-require-approval December 9, 2024 21:28 — with GitHub Actions Inactive

brianf-aws had a problem deploying to ml-commons-cicd-env-require-approval December 9, 2024 21:28 — with GitHub Actions Error

brianf-aws temporarily deployed to ml-commons-cicd-env-require-approval December 9, 2024 21:29 — with GitHub Actions Inactive

brianf-aws had a problem deploying to ml-commons-cicd-env-require-approval December 9, 2024 21:29 — with GitHub Actions Error

brianf-aws temporarily deployed to ml-commons-cicd-env-require-approval December 9, 2024 21:29 — with GitHub Actions Inactive

brianf-aws had a problem deploying to ml-commons-cicd-env-require-approval December 9, 2024 21:29 — with GitHub Actions Failure

add retry according to how many rest clients are in a IT cluster

c511bef

Signed-off-by: Brian Flores <[email protected]>

brianf-aws had a problem deploying to ml-commons-cicd-env-require-approval December 10, 2024 00:44 — with GitHub Actions Error

brianf-aws had a problem deploying to ml-commons-cicd-env-require-approval December 10, 2024 00:44 — with GitHub Actions Failure

brianf-aws had a problem deploying to ml-commons-cicd-env-require-approval December 10, 2024 00:44 — with GitHub Actions Error

brianf-aws had a problem deploying to ml-commons-cicd-env-require-approval December 10, 2024 00:44 — with GitHub Actions Failure

brianf-aws had a problem deploying to ml-commons-cicd-env-require-approval December 10, 2024 00:44 — with GitHub Actions Error

fix retry initialization

6208dab

The MAX_RETRIES variable had to wait for the cluster to form before it could call to get the cluster size Signed-off-by: Brian Flores <[email protected]>

brianf-aws had a problem deploying to ml-commons-cicd-env-require-approval December 10, 2024 01:24 — with GitHub Actions Error

brianf-aws temporarily deployed to ml-commons-cicd-env-require-approval December 10, 2024 01:24 — with GitHub Actions Inactive

brianf-aws had a problem deploying to ml-commons-cicd-env-require-approval December 10, 2024 01:24 — with GitHub Actions Error

brianf-aws temporarily deployed to ml-commons-cicd-env-require-approval December 10, 2024 01:24 — with GitHub Actions Inactive

brianf-aws had a problem deploying to ml-commons-cicd-env-require-approval December 10, 2024 01:24 — with GitHub Actions Failure

brianf-aws temporarily deployed to ml-commons-cicd-env-require-approval December 10, 2024 01:24 — with GitHub Actions Inactive

brianf-aws temporarily deployed to ml-commons-cicd-env-require-approval December 10, 2024 04:55 — with GitHub Actions Inactive

brianf-aws temporarily deployed to ml-commons-cicd-env-require-approval December 10, 2024 17:12 — with GitHub Actions Inactive

brianf-aws had a problem deploying to ml-commons-cicd-env-require-approval December 12, 2024 19:37 — with GitHub Actions Failure

brianf-aws temporarily deployed to ml-commons-cicd-env-require-approval December 30, 2024 22:32 — with GitHub Actions Inactive

brianf-aws temporarily deployed to ml-commons-cicd-env-require-approval December 30, 2024 23:52 — with GitHub Actions Inactive

dhrubo-os approved these changes Dec 31, 2024

View reviewed changes

mingshl approved these changes Dec 31, 2024

View reviewed changes

dhrubo-os merged commit a2befde into opensearch-project:2.x Dec 31, 2024
15 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(Backport 2.x) Fixes Two Flaky IT classes RestMLGuardrailsIT & ToolIntegrationWithLLMTest #3263

(Backport 2.x) Fixes Two Flaky IT classes RestMLGuardrailsIT & ToolIntegrationWithLLMTest #3263

brianf-aws commented Dec 9, 2024

ylwu-amzn commented Dec 9, 2024

dhrubo-os commented Dec 9, 2024

brianf-aws commented Dec 9, 2024

brianf-aws commented Dec 10, 2024 •

edited

Loading

mingshl commented Dec 30, 2024

(Backport 2.x) Fixes Two Flaky IT classes RestMLGuardrailsIT & ToolIntegrationWithLLMTest #3263

(Backport 2.x) Fixes Two Flaky IT classes RestMLGuardrailsIT & ToolIntegrationWithLLMTest #3263

Conversation

brianf-aws commented Dec 9, 2024

Testing

ylwu-amzn commented Dec 9, 2024

dhrubo-os commented Dec 9, 2024

brianf-aws commented Dec 9, 2024

brianf-aws commented Dec 10, 2024 • edited Loading

mingshl commented Dec 30, 2024

brianf-aws commented Dec 10, 2024 •

edited

Loading