Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(Backport 2.x) Fixes Two Flaky IT classes RestMLGuardrailsIT & ToolIntegrationWithLLMTest #3263

Merged

Conversation

brianf-aws
Copy link
Contributor

This backport (2.x) is to improve two Flaky classes (RestMLGuardrailsIT, ToolIntegrationWithLLMTest). This was a result of #3253

Testing

  • ./gradlew :opensearch-ml-plugin:compileJava
  • ./gradlew test

@brianf-aws brianf-aws had a problem deploying to ml-commons-cicd-env-require-approval December 9, 2024 19:32 — with GitHub Actions Error
@brianf-aws brianf-aws had a problem deploying to ml-commons-cicd-env-require-approval December 9, 2024 19:32 — with GitHub Actions Failure
@brianf-aws brianf-aws had a problem deploying to ml-commons-cicd-env-require-approval December 9, 2024 19:32 — with GitHub Actions Failure
@brianf-aws brianf-aws had a problem deploying to ml-commons-cicd-env-require-approval December 9, 2024 19:32 — with GitHub Actions Failure
@brianf-aws brianf-aws had a problem deploying to ml-commons-cicd-env-require-approval December 9, 2024 19:32 — with GitHub Actions Error
@brianf-aws brianf-aws had a problem deploying to ml-commons-cicd-env-require-approval December 9, 2024 19:32 — with GitHub Actions Error
@ylwu-amzn
Copy link
Collaborator

org.opensearch.ml.rest.RestMLRemoteInferenceIT > testOpenAITextEmbeddingModel_ISO8859_1 FAILED

But this UT already removed in PR #3159, @dhrubo-os can you check if this PR backported to 2.x ?

@dhrubo-os
Copy link
Collaborator

org.opensearch.ml.rest.RestMLRemoteInferenceIT > testOpenAITextEmbeddingModel_ISO8859_1 FAILED

But this UT already removed in PR #3159, @dhrubo-os can you check if this PR backported to 2.x ?

Yes: https://github.com/opensearch-project/ml-commons/pull/3163/files

@brianf-aws do you have all the updates from 2.x branch?

@brianf-aws
Copy link
Contributor Author

Yes good callout I havent updated my local 2.x branch in awhile. will remedy this ASAP

…MTest (opensearch-project#3253)

* fix uneeded call to get model_id for task api within RestMLGuardrailsIT

Following opensearch-project#3244 this IT called the task api to check the model id again however this is redundant. Instead one can directly pull the model_id upon creating the model group. Manual testing was done to see that the behavior is intact, this should help reduce the calls within a IT to make it less flaky

Signed-off-by: Brian Flores <[email protected]>

* fix ToolIntegrationWithLLMTest model undeploy race condition

Previously the test class attempted to delete a model without fully knowing if the model was undeployed in time. This change adds a waiting for 5 retries each 1 second to check the status of the model and only when undeployed will it proceed to delete the model. When the number of retries are exceeded it throws a error indicating a deeper problem. Manual testing was done to check that the model is undeployed by searching for the specific model via the checkForModelUndeployedStatus method.

Signed-off-by: Brian Flores <[email protected]>

---------

Signed-off-by: Brian Flores <[email protected]>
(cherry picked from commit 1a659c8)
@brianf-aws brianf-aws force-pushed the backport/backport-3253-to-2.x branch from 7fdd0f2 to ee4845e Compare December 9, 2024 21:28
@brianf-aws brianf-aws temporarily deployed to ml-commons-cicd-env-require-approval December 9, 2024 21:28 — with GitHub Actions Inactive
@brianf-aws brianf-aws had a problem deploying to ml-commons-cicd-env-require-approval December 9, 2024 21:28 — with GitHub Actions Error
@brianf-aws brianf-aws temporarily deployed to ml-commons-cicd-env-require-approval December 9, 2024 21:29 — with GitHub Actions Inactive
@brianf-aws brianf-aws had a problem deploying to ml-commons-cicd-env-require-approval December 9, 2024 21:29 — with GitHub Actions Error
@brianf-aws brianf-aws temporarily deployed to ml-commons-cicd-env-require-approval December 9, 2024 21:29 — with GitHub Actions Inactive
@brianf-aws brianf-aws had a problem deploying to ml-commons-cicd-env-require-approval December 9, 2024 21:29 — with GitHub Actions Failure
@brianf-aws
Copy link
Contributor Author

brianf-aws commented Dec 10, 2024

I am seeing that this is failing besides with the retry; after researching I can gather the following

...
initializing REST clients against [http://[::1]:40677, http://127.0.0.1:43017,/ http://[::1]:34869, http://127.0.0.1:38417,/ http://[::1]:37987, http://127.0.0.1:40451]/
...
org.opensearch.ml.tools.VisualizationsToolIT > testVisualizationFound STANDARD_OUT
    [2024-12-09T19:36:43,740][INFO ][o.o.m.t.VisualizationsToolIT] [testVisualizationFound] The 1-th attempt on GET:/_plugins/_ml/models/VpWRrZMBriS5AgkXTdYF . response: Response{requestLine=GET /_plugins/_ml/models/VpWRrZMBriS5AgkXTdYF HTTP/1.1, host=http://[::1]:37987, response=HTTP/1.1 200 OK}
    [2024-12-09T19:36:44,742][INFO ][o.o.m.t.VisualizationsToolIT] [testVisualizationFound] The 2-th attempt on GET:/_plugins/_ml/models/VpWRrZMBriS5AgkXTdYF . response: Response{requestLine=GET /_plugins/_ml/models/VpWRrZMBriS5AgkXTdYF HTTP/1.1, host=http://127.0.0.1:38417,/ response=HTTP/1.1 200 OK}
    [2024-12-09T19:36:45,745][INFO ][o.o.m.t.VisualizationsToolIT] [testVisualizationFound] The 3-th attempt on GET:/_plugins/_ml/models/VpWRrZMBriS5AgkXTdYF . response: Response{requestLine=GET /_plugins/_ml/models/VpWRrZMBriS5AgkXTdYF HTTP/1.1, host=http://[::1]:34869, response=HTTP/1.1 200 OK}
    [2024-12-09T19:36:46,749][INFO ][o.o.m.t.VisualizationsToolIT] [testVisualizationFound] The 4-th attempt on GET:/_plugins/_ml/models/VpWRrZMBriS5AgkXTdYF . response: Response{requestLine=GET /_plugins/_ml/models/VpWRrZMBriS5AgkXTdYF HTTP/1.1, host=http://127.0.0.1:43017,/ response=HTTP/1.1 200 OK}
    [2024-12-09T19:36:47,751][INFO ][o.o.m.t.VisualizationsToolIT] [testVisualizationFound] The 5-th attempt on GET:/_plugins/_ml/models/VpWRrZMBriS5AgkXTdYF . response: Response{requestLine=GET /_plugins/_ml/models/VpWRrZMBriS5AgkXTdYF HTTP/1.1, host=http://[::1]:40677, response=HTTP/1.1 200 OK}



org.opensearch.ml.tools.VisualizationsToolIT > testVisualizationFound STANDARD_ERROR
REPRODUCE WITH: ./gradlew ':opensearch-ml-plugin:integTest' --tests "org.opensearch.ml.tools.VisualizationsToolIT.testVisualizationFound" -Dtests.seed=8F817C566A4FD343 -Dtests.security.manager=false -Dtests.locale=ar-DZ -Dtests.timezone=America/Rosario -Druntime.java=21

Here there are 6 rest clients (http://[::1]:40677, http://127.0.0.1:43017,/ http://[::1]:34869, http://127.0.0.1:38417,/ http://[::1]:37987, http://127.0.0.1:40451]) The current amount of retries is hardcoded to 5. My speculation is that the rest client that had the correct result was not hit in time. During this specific test it hits the addresses by port 37987, 38417, 34869, 43017, 40677 (5 in total before throwing the exception it took to long.) Normally when running the reproduce with ... it will launch a simple cluster from what I see is 2 rest clients.

I also checked that the test class VisualizationsToolIT follows the extension flow VisualizationsToolIT -> ToolIntegrationWithLLMTest -> RestBaseAgentToolsIT -> MLCommonsRestTestCase -> OpenSearchRestTestCase -> ... The last class highlighted (OpenSearchRestTestCase) states that it is used against a external cluster. This makes it harder to test locally as the Github CI is running a config unknown to us at runtime.


There exists a method within OpenSearchTestCase

    protected final List<HttpHost> getClusterHosts() {
        return clusterHosts;
    }

Which we can use as the amount of retires instead of a hardcoded amount of 5. This way we can account for multinode clusters of any length.


To further this point On the main branch if you look at a recent build pass you will see that this specific test ran a cluster with 2 rest clients and passed

VisualizationsToolIT > testVisualizationFound STANDARD_OUT
    [2024-12-07T01:14:53,093][INFO ][o.o.m.t.VisualizationsToolIT] [testVisualizationFound] before test
    [2024-12-07T01:14:53,098][INFO ][o.o.m.t.VisualizationsToolIT] [testVisualizationFound] initializing REST clients against [http://[::1]:39355, http://127.0.0.1:44083]/
    [2024-12-07T01:14:54,466][INFO ][o.o.m.t.ToolIntegrationWithLLMTest] [testVisualizationFound] model_id: oPR4npMB3nb1h2n5Ia7C, agent_id: ovR4npMB3nb1h2n5Ja75

VisualizationsToolIT > testVisualizationFound STANDARD_ERROR
    dec. 07, 2024 1:14:54 AM org.opensearch.client.RestClient logResponse
    WARNING: request [POST http://[::1]:39355/.kibana/_doc/d22f6bee-71fb-422e-9d87-b1cb6b20b042?refresh=true] returned 1 warnings: [299 OpenSearch-3.0.0-SNAPSHOT-75a2fc3629260bb140e38368b5afb21f78345e79 "index name [.kibana] starts with a dot '.', in the next major version, index names starting with a dot are reserved for hidden indices and system indices"]

VisualizationsToolIT > testVisualizationFound STANDARD_OUT
    [2024-12-07T01:14:55,968][INFO ][o.o.m.t.VisualizationsToolIT] [testVisualizationFound] after test

VisualizationsToolIT > testVisualizationNotFound STANDARD_OUT
    [2024-12-07T01:14:55,971][INFO ][o.o.m.t.VisualizationsToolIT] [testVisualizationNotFound] before test
    [2024-12-07T01:14:57,327][INFO ][o.o.m.t.ToolIntegrationWithLLMTest] [testVisualizationNotFound] model_id: q_R4npMB3nb1h2n5LK7u, agent_id: rfR4npMB3nb1h2n5Ma4n
    [2024-12-07T01:14:58,661][INFO ][o.o.m.t.VisualizationsToolIT] [testVisualizationNotFound] after test

@brianf-aws brianf-aws had a problem deploying to ml-commons-cicd-env-require-approval December 10, 2024 00:44 — with GitHub Actions Error
@brianf-aws brianf-aws had a problem deploying to ml-commons-cicd-env-require-approval December 10, 2024 00:44 — with GitHub Actions Failure
@brianf-aws brianf-aws had a problem deploying to ml-commons-cicd-env-require-approval December 10, 2024 00:44 — with GitHub Actions Error
@brianf-aws brianf-aws had a problem deploying to ml-commons-cicd-env-require-approval December 10, 2024 00:44 — with GitHub Actions Error
@brianf-aws brianf-aws had a problem deploying to ml-commons-cicd-env-require-approval December 10, 2024 00:44 — with GitHub Actions Failure
@brianf-aws brianf-aws had a problem deploying to ml-commons-cicd-env-require-approval December 10, 2024 00:44 — with GitHub Actions Error
The MAX_RETRIES variable had to wait for the cluster to form before it could call to get the cluster size

Signed-off-by: Brian Flores <[email protected]>
@brianf-aws brianf-aws had a problem deploying to ml-commons-cicd-env-require-approval December 10, 2024 01:24 — with GitHub Actions Error
@brianf-aws brianf-aws temporarily deployed to ml-commons-cicd-env-require-approval December 10, 2024 01:24 — with GitHub Actions Inactive
@brianf-aws brianf-aws had a problem deploying to ml-commons-cicd-env-require-approval December 10, 2024 01:24 — with GitHub Actions Error
@brianf-aws brianf-aws temporarily deployed to ml-commons-cicd-env-require-approval December 10, 2024 01:24 — with GitHub Actions Inactive
@brianf-aws brianf-aws had a problem deploying to ml-commons-cicd-env-require-approval December 10, 2024 01:24 — with GitHub Actions Failure
@brianf-aws brianf-aws temporarily deployed to ml-commons-cicd-env-require-approval December 10, 2024 01:24 — with GitHub Actions Inactive
@brianf-aws brianf-aws temporarily deployed to ml-commons-cicd-env-require-approval December 10, 2024 04:55 — with GitHub Actions Inactive
@brianf-aws brianf-aws temporarily deployed to ml-commons-cicd-env-require-approval December 10, 2024 04:55 — with GitHub Actions Inactive
@brianf-aws brianf-aws temporarily deployed to ml-commons-cicd-env-require-approval December 10, 2024 04:55 — with GitHub Actions Inactive
@brianf-aws brianf-aws temporarily deployed to ml-commons-cicd-env-require-approval December 10, 2024 17:12 — with GitHub Actions Inactive
@brianf-aws brianf-aws had a problem deploying to ml-commons-cicd-env-require-approval December 12, 2024 19:37 — with GitHub Actions Failure
@mingshl
Copy link
Collaborator

mingshl commented Dec 30, 2024

REPRODUCE WITH: gradlew ':opensearch-ml-plugin:test' --tests "org.opensearch.ml.action.models.GetModelITTests.testGetModel_NullModelIdException" -Dtests.seed=4B0F5237C93E050D -Dtests.security.manager=false -Dtests.locale=id-ID -Dtests.timezone=Asia/Ust-Nera -Druntime.java=21
org.opensearch.ml.action.models.GetModelITTests > testGetModel_NullModelIdException FAILED
    OpenSearchTimeoutException[java.util.concurrent.TimeoutException: Timeout waiting for task.]; nested: TimeoutException[Timeout waiting for task.];
        at __randomizedtesting.SeedInfo.seed([4B0F5237C93E050D:438E9DB887F36DF8]:0)
        at app//org.opensearch.common.util.concurrent.FutureUtils.get(FutureUtils.java:96)
        at app//org.opensearch.action.support.AdapterActionFuture.actionGet(AdapterActionFuture.java:79)
        at app//org.opensearch.action.support.AdapterActionFuture.actionGet(AdapterActionFuture.java:68)
        at app//org.opensearch.ml.action.MLCommonsIntegTestCase.loadIrisData(MLCommonsIntegTestCase.java:142)
        at app//org.opensearch.ml.action.models.GetModelITTests.setUp(GetModelITTests.java:29)

        Caused by:
        java.util.concurrent.TimeoutException: Timeout waiting for task.
            at org.opensearch.common.util.concurrent.BaseFuture$Sync.get(BaseFuture.java:257)
            at org.opensearch.common.util.concurrent.BaseFuture.get(BaseFuture.java:82)
            at org.opensearch.common.util.concurrent.FutureUtils.get(FutureUtils.java:94)
            ... 4 more

@brianf-aws brianf-aws temporarily deployed to ml-commons-cicd-env-require-approval December 30, 2024 22:32 — with GitHub Actions Inactive
@brianf-aws brianf-aws temporarily deployed to ml-commons-cicd-env-require-approval December 30, 2024 23:52 — with GitHub Actions Inactive
@dhrubo-os dhrubo-os merged commit a2befde into opensearch-project:2.x Dec 31, 2024
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants