Skip to content

Conversation

@rithin-pullela-aws
Copy link
Contributor

Description

We are seeing Disk Circuit breaker exceptions on CI.

2> REPRODUCE WITH: ./gradlew ':opensearch-ml-plugin:test' --tests 'org.opensearch.ml.action.prediction.PredictionITTests.testPredictionWithDataFrame_FitRCF' -Dtests.seed=B829ADF0B377D5F4 -Dtests.security.manager=false -Dtests.locale=rm -Dtests.timezone=Canada/Newfoundland -Druntime.java=24
  2> CircuitBreakingException[Disk Circuit Breaker is open, please check your resources!]
        at __randomizedtesting.SeedInfo.seed([B829ADF0B377D5F4:201DE5E6606A60B0]:0)
        at org.opensearch.ml.utils.MLNodeUtils.checkOpenCircuitBreaker(MLNodeUtils.java:138)
        at org.opensearch.ml.task.MLTaskRunner.checkCBAndExecute(MLTaskRunner.java:157)

 ....

This happens because the default Threshold is 5 GB which is high for a CI environment (code red)

The failures happen during the setup part of the failing tests:
code ref

batchRcfModelId = trainBatchRCFWithDataFrame(500, false);
fitRcfModelId = trainFitRCFWithDataFrame(500, false);
linearRegressionModelId = trainLinearRegressionWithDataFrame(100, false);
logisticRegressionModelId = trainLogisticRegressionWithIrisData(irisIndexName, false);

I believe these operations involve indexing causing the disk space to go over the threshold.

Related Issues

Resolves #[Issue number to be closed when this PR is merged]

Check List

  • New functionality includes testing.
  • New functionality has been documented.
  • API changes companion pull request created.
  • Commits are signed per the DCO using --signoff.
  • Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@rithin-pullela-aws rithin-pullela-aws temporarily deployed to ml-commons-cicd-env-require-approval November 11, 2025 20:57 — with GitHub Actions Inactive
@rithin-pullela-aws rithin-pullela-aws had a problem deploying to ml-commons-cicd-env-require-approval November 11, 2025 20:57 — with GitHub Actions Failure
@rithin-pullela-aws rithin-pullela-aws temporarily deployed to ml-commons-cicd-env-require-approval November 11, 2025 20:57 — with GitHub Actions Inactive
@rithin-pullela-aws rithin-pullela-aws had a problem deploying to ml-commons-cicd-env-require-approval November 11, 2025 20:57 — with GitHub Actions Error
@rithin-pullela-aws
Copy link
Contributor Author

rithin-pullela-aws commented Nov 11, 2025

503 Error:

> Could not HEAD 'https://ci.opensearch.org/ci/dbc/snapshots/maven/org/locationtech/spatial4j/spatial4j/0.7/spatial4j-0.7.pom'. Received status code 503 from server: Service Unavailable

Could be a transitive issue, can we re run linux CI?

@rithin-pullela-aws rithin-pullela-aws had a problem deploying to ml-commons-cicd-env-require-approval November 11, 2025 22:05 — with GitHub Actions Error
@rithin-pullela-aws rithin-pullela-aws had a problem deploying to ml-commons-cicd-env-require-approval November 11, 2025 22:05 — with GitHub Actions Failure
@rithin-pullela-aws
Copy link
Contributor Author

Failing tests not related to Circuit breaker, known falky IT: MLModelAutoReDeployerIT

Can we rerun CI?

MLModelAutoReDeployerIT > testModelAutoRedeploy STANDARD_ERROR

    REPRODUCE WITH: ./gradlew ':opensearch-ml-plugin:integTest' --tests 'org.opensearch.ml.autoredeploy.MLModelAutoReDeployerIT.testModelAutoRedeploy' -Dtests.seed=50960C2AC81038BD -Dtests.security.manager=false -Dtests.locale=bgc-IN -Dtests.timezone=Asia/Hong_Kong -Druntime.java=21
REPRODUCE WITH: ./gradlew ':opensearch-ml-plugin:integTest' --tests 'org.opensearch.ml.autoredeploy.MLModelAutoReDeployerIT.testModelAutoRedeploy' -Dtests.seed=50960C2AC81038BD -Dtests.security.manager=false -Dtests.locale=bgc-IN -Dtests.timezone=Asia/Hong_Kong -Druntime.java=21


Suite: Test class org.opensearch.ml.autoredeploy.MLModelAutoReDeployerIT
  2> नवम्बर १२, २०२५ ७:५६:०२ तडके org.opensearch.client.RestClient logResponse
  2> WARNING: request [DELETE http://[::1]:43281/.plugins-ml-model-group] returned 1 warnings: [299 OpenSearch-3.4.0-SNAPSHOT-ad9794b127b151c9749c3236cf052a1e58f9a9a5 "this request accesses system indices: [.plugins-ml-model-group], but in a future major version, direct access to system indices will be prevented by default"]
  2> नवम्बर १२, २०२५ ७:५६:०२ तडके org.opensearch.client.RestClient logResponse
  2> WARNING: request [DELETE http://127.0.0.1:46015/.plugins-ml-model] returned 1 warnings: [299 OpenSearch-3.4.0-SNAPSHOT-ad9794b127b151c9749c3236cf052a1e58f9a9a5 "this request accesses system indices: [.plugins-ml-model], but in a future major version, direct access to system indices will be prevented by default"]
  2> नवम्बर १२, २०२५ ७:५६:०२ तडके org.opensearch.client.RestClient logResponse
  2> WARNING: request [DELETE http://[::1]:43281/.plugins-ml-task] returned 1 warnings: [299 OpenSearch-3.4.0-SNAPSHOT-ad9794b127b151c9749c3236cf052a1e58f9a9a5 "this request accesses system indices: [.plugins-ml-task], but in a future major version, direct access to system indices will be prevented by default"]
  2> REPRODUCE WITH: ./gradlew ':opensearch-ml-plugin:integTest' --tests 'org.opensearch.ml.autoredeploy.MLModelAutoReDeployerIT.testModelAutoRedeploy' -Dtests.seed=50960C2AC81038BD -Dtests.security.manager=false -Dtests.locale=bgc-IN -Dtests.timezone=Asia/Hong_Kong -Druntime.java=21
  2> java.lang.AssertionError: method [POST], host [http://127.0.0.1:46015/], URI [/_plugins/_ml/models/null/_deploy], status line [HTTP/1.1 404 Not Found]
    {"error":{"root_cause":[{"type":"status_exception","reason":"Failed to find model"}],"type":"status_exception","reason":"Failed to find model"},"status":404}
        at __randomizedtesting.SeedInfo.seed([50960C2AC81038BD:4E1BF85EC13E6F41]:0)
        at org.junit.Assert.fail(Assert.java:89)
        at org.opensearch.ml.autoredeploy.MLModelAutoReDeployerIT.lambda$prepareModel$0(MLModelAutoReDeployerIT.java:52)
        at org.opensearch.ml.rest.MLCommonsRestTestCase.verifyResponse(MLCommonsRestTestCase.java:686)
        at org.opensearch.ml.rest.MLCommonsRestTestCase.getTask(MLCommonsRestTestCase.java:650)
        at org.opensearch.ml.autoredeploy.MLModelAutoReDeployerIT.lambda$prepareModel$1(MLModelAutoReDeployerIT.java:37)
        at org.opensearch.ml.rest.MLCommonsRestTestCase.verifyResponse(MLCommonsRestTestCase.java:686)
        at org.opensearch.ml.rest.MLCommonsRestTestCase.getTask(MLCommonsRestTestCase.java:650)
        at org.opensearch.ml.autoredeploy.MLModelAutoReDeployerIT.prepareModel(MLModelAutoReDeployerIT.java:33)
        at org.opensearch.ml.autoredeploy.MLModelAutoReDeployerIT.testModelAutoRedeploy(MLModelAutoReDeployerIT.java:21)
  2> NOTE: leaving temporary files on disk at: /__w/ml-commons/ml-commons/plugin/build/testrun/integTest/temp/org.opensearch.ml.autoredeploy.MLModelAutoReDeployerIT_50960C2AC81038BD-001
  2> NOTE: test params are: codec=Asserting(Lucene103), sim=Asserting(RandomSimilarity(queryNorm=false): {}), locale=bgc-IN, timezone=Asia/Hong_Kong
  2> NOTE: Linux 6.11.0-1018-azure amd64/Azul Systems, Inc. 21.0.9 (64-bit)/cpus=4,threads=1,free=418687248,total=536870912
  2> NOTE: All tests run in this JVM: [HFAnalyzerIT, MLModelAutoReDeployerIT]

MLModelAutoReDeployerIT > testModelAutoRedeploy FAILED
    java.lang.AssertionError: method [POST], host [http://127.0.0.1:46015/], URI [/_plugins/_ml/models/null/_deploy], status line [HTTP/1.1 404 Not Found]
    {"error":{"root_cause":[{"type":"status_exception","reason":"Failed to find model"}],"type":"status_exception","reason":"Failed to find model"},"status":404}
        at __randomizedtesting.SeedInfo.seed([50960C2AC81038BD:4E1BF85EC13E6F41]:0)
        at org.junit.Assert.fail(Assert.java:89)
        at org.opensearch.ml.autoredeploy.MLModelAutoReDeployerIT.lambda$prepareModel$0(MLModelAutoReDeployerIT.java:52)
        at org.opensearch.ml.rest.MLCommonsRestTestCase.verifyResponse(MLCommonsRestTestCase.java:686)
        at org.opensearch.ml.rest.MLCommonsRestTestCase.getTask(MLCommonsRestTestCase.java:650)
        at org.opensearch.ml.autoredeploy.MLModelAutoReDeployerIT.lambda$prepareModel$1(MLModelAutoReDeployerIT.java:37)
        at org.opensearch.ml.rest.MLCommonsRestTestCase.verifyResponse(MLCommonsRestTestCase.java:686)
        at org.opensearch.ml.rest.MLCommonsRestTestCase.getTask(MLCommonsRestTestCase.java:650)
        at org.opensearch.ml.autoredeploy.MLModelAutoReDeployerIT.prepareModel(MLModelAutoReDeployerIT.java:33)
        at org.opensearch.ml.autoredeploy.MLModelAutoReDeployerIT.testModelAutoRedeploy(MLModelAutoReDeployerIT.java:21)

MLModelAutoReDeployerIT STANDARD_ERROR
    NOTE: leaving temporary files on disk at: /__w/ml-commons/ml-commons/plugin/build/testrun/integTest/temp/org.opensearch.ml.autoredeploy.MLModelAutoReDeployerIT_50960C2AC81038BD-001
    NOTE: test params are: codec=Asserting(Lucene103), sim=Asserting(RandomSimilarity(queryNorm=false): {}), locale=bgc-IN, timezone=Asia/Hong_Kong
    NOTE: Linux 6.11.0-1018-azure amd64/Azul Systems, Inc. 21.0.9 (64-bit)/cpus=4,threads=1,free=418687248,total=536870912
    NOTE: All tests run in this JVM: [HFAnalyzerIT, MLModelAutoReDeployerIT]
  1> [2025-11-12T07:55:58,931][INFO ][o.o.m.a.MLModelAutoReDeployerIT] [testModelAutoRedeploy] before test
  1> [2025-11-12T07:55:58,938][INFO ][o.o.m.a.MLModelAutoReDeployerIT] [testModelAutoRedeploy] initializing REST clients against [http://[::1]:43281, http://127.0.0.1:46015]/
  1> [2025-11-12T07:56:02,529][INFO ][o.o.m.a.MLModelAutoReDeployerIT] [testModelAutoRedeploy] There are still tasks running after this test that might break subsequent tests [indices:data/write/bulk[s], indices:data/write/bulk[s][p], indices:data/write/index].
  1> [2025-11-12T07:56:02,531][INFO ][o.o.m.a.MLModelAutoReDeployerIT] [testModelAutoRedeploy] after test

@rithin-pullela-aws rithin-pullela-aws temporarily deployed to ml-commons-cicd-env-require-approval November 12, 2025 01:14 — with GitHub Actions Inactive
@codecov
Copy link

codecov bot commented Nov 12, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 80.16%. Comparing base (a00b7de) to head (6469257).
⚠️ Report is 4 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main    #4413      +/-   ##
============================================
+ Coverage     80.13%   80.16%   +0.03%     
- Complexity    10217    10225       +8     
============================================
  Files           855      858       +3     
  Lines         44439    44497      +58     
  Branches       5142     5145       +3     
============================================
+ Hits          35610    35673      +63     
+ Misses         6665     6663       -2     
+ Partials       2164     2161       -3     
Flag Coverage Δ
ml-commons 80.16% <ø> (+0.03%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@rithin-pullela-aws rithin-pullela-aws requested a deployment to ml-commons-cicd-env-require-approval November 12, 2025 03:54 — with GitHub Actions Waiting
@ylwu-amzn ylwu-amzn merged commit c243f8a into opensearch-project:main Nov 12, 2025
30 of 36 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants