Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade PyTorch Version #3628

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

nathaliellenaa
Copy link
Contributor

Description

Upgrade PyTorch version from 1.13.1 to 2.5.1.

Related Issues

Resolves #3515

Check List

  • New functionality includes testing.
  • New functionality has been documented.
  • API changes companion pull request created.
  • Commits are signed per the DCO using --signoff.
  • Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Sorry, something went wrong.

@nathaliellenaa nathaliellenaa had a problem deploying to ml-commons-cicd-env-require-approval March 6, 2025 20:08 — with GitHub Actions Error
@nathaliellenaa nathaliellenaa had a problem deploying to ml-commons-cicd-env-require-approval March 6, 2025 20:08 — with GitHub Actions Failure
@nathaliellenaa nathaliellenaa had a problem deploying to ml-commons-cicd-env-require-approval March 6, 2025 20:08 — with GitHub Actions Error
@nathaliellenaa nathaliellenaa had a problem deploying to ml-commons-cicd-env-require-approval March 6, 2025 20:08 — with GitHub Actions Failure
@nathaliellenaa
Copy link
Contributor Author

Failing tests due to flakiness:

This PR should fix the CB exception

REPRODUCE WITH: ./gradlew ':opensearch-ml-plugin:integTest' --tests "org.opensearch.ml.rest.RestMLRemoteInferenceIT.testPredictWithAutoDeployAndTTL_RemoteModel" -Dtests.seed=65B2F7E3C09FCCCC -Dtests.security.manager=false -Dtests.locale=es-GT -Dtests.timezone=Etc/GMT-1 -Druntime.java=21
RestMLRemoteInferenceIT > testPredictWithAutoDeployAndTTL_RemoteModel FAILED
    org.opensearch.client.ResponseException: method [POST], host [http://[::1]:41277], URI [/_plugins/_ml/models/EAfIbZUB0zsuZWfWcmMw/_predict], status line [HTTP/1.1 429 Too Many Requests]
    {"error":{"root_cause":[{"type":"circuit_breaking_exception","reason":"Memory Circuit Breaker is open, please check your resources!","bytes_wanted":0,"bytes_limit":0,"durability":"TRANSIENT"}],"type":"circuit_breaking_exception","reason":"Memory Circuit Breaker is open, please check your resources!","bytes_wanted":0,"bytes_limit":0,"durability":"TRANSIENT"},"status":429}

Related PR for the test_cohereInference_withDifferent_postProcessFunction flaky test

REPRODUCE WITH: ./gradlew ':opensearch-ml-plugin:integTest' --tests "org.opensearch.ml.rest.RestCohereInferenceIT.test_cohereInference_withDifferent_postProcessFunction" -Dtests.seed=81C0DAB2100A16D7 -Dtests.security.manager=false -Dtests.locale=uz-Cyrl -Dtests.timezone=Asia/Atyrau -Druntime.java=21
RestCohereInferenceIT > test_cohereInference_withDifferent_postProcessFunction FAILED
    java.lang.AssertionError: failed to run test with test name: connector.post_process.cohere_v2.embedding.ubinary_test

The testPredictionWithDataFrame_FitRCF flaky test is being addressed in this PR

REPRODUCE WITH: ./gradlew ':opensearch-ml-plugin:test' --tests "org.opensearch.ml.action.prediction.PredictionITTests.testPredictionWithDataFrame_FitRCF" -Dtests.seed=B65A434F8ABE4302 -Dtests.security.manager=false -Dtests.locale=lij-IT -Dtests.timezone=Atlantic/South_Georgia -Druntime.java=23
PredictionITTests > testPredictionWithDataFrame_FitRCF FAILED
  java.util.ConcurrentModificationException

@nathaliellenaa nathaliellenaa temporarily deployed to ml-commons-cicd-env-require-approval March 15, 2025 01:13 — with GitHub Actions Inactive
@nathaliellenaa nathaliellenaa had a problem deploying to ml-commons-cicd-env-require-approval March 15, 2025 01:13 — with GitHub Actions Failure
@nathaliellenaa nathaliellenaa temporarily deployed to ml-commons-cicd-env-require-approval March 15, 2025 01:13 — with GitHub Actions Inactive
@nathaliellenaa nathaliellenaa had a problem deploying to ml-commons-cicd-env-require-approval March 15, 2025 01:13 — with GitHub Actions Error
@mingshl
Copy link
Collaborator

mingshl commented Mar 15, 2025

@nathaliellenaa can you look into this flaky test, seems an easy fix

just try to add a null check before getting it

REPRODUCE WITH: ./gradlew ':opensearch-ml-plugin:integTest' --tests "org.opensearch.ml.rest.RestCohereInferenceIT.test_cohereInference_withDifferent_postProcessFunction" -Dtests.seed=F927A39C1DD2B916 -Dtests.security.manager=false -Dtests.locale=uz-AF -Dtests.timezone=Europe/Zurich -Druntime.java=21
RestCohereInferenceIT > test_cohereInference_withDifferent_postProcessFunction STANDARD_ERROR
    REPRODUCE WITH: ./gradlew ':opensearch-ml-plugin:integTest' --tests "org.opensearch.ml.rest.RestCohereInferenceIT.test_cohereInference_withDifferent_postProcessFunction" -Dtests.seed=F927A39C1DD2B916 -Dtests.security.manager=false -Dtests.locale=uz-AF -Dtests.timezone=Europe/Zurich -Druntime.java=21

RestCohereInferenceIT > test_cohereInference_withDifferent_postProcessFunction FAILED
    java.lang.NullPointerException: Cannot invoke "Object.getClass()" because the return value of "java.util.Map.get(Object)" is null
        at __randomizedtesting.SeedInfo.seed([F927A39C1DD2B916:AA93BEE3AD0CC3E]:0)
        at org.opensearch.ml.rest.RestCohereInferenceIT.validateOutput(RestCohereInferenceIT.java:93)
        at org.opensearch.ml.rest.RestCohereInferenceIT.test_cohereInference_withDifferent_postProcessFunction(RestCohereInferenceIT.java:81)

RestCohereInferenceIT STANDARD_ERROR
    NOTE: leaving temporary files on disk at: /__w/ml-commons/ml-commons/plugin/build/testrun/integTest/temp/org.opensearch.ml.rest.RestCohereInferenceIT_F927A39C1DD2B916-001
    NOTE: test params are: codec=Asserting(Lucene101): {}, docValues:{}, maxPointsInLeafNode=811, maxMBSortInHeap=6.325716610231618, sim=Asserting(RandomSimilarity(queryNorm=false): {}), locale=uz-AF, timezone=Europe/Zurich
    NOTE: Linux 6.8.0-1021-azure amd64/Azul Systems, Inc. 21.0.6 (64-bit)/cpus=4,threads=1,free=383631592,total=536870912
    NOTE: All tests run in this JVM: [MLModelAutoReDeployerIT, RestBedRockInferenceIT, RestCohereInferenceIT]
  1> [2025-03-15T02:49:47,022][INFO ][o.o.m.r.RestCohereInferenceIT] [test_cohereInference_withDifferent_postProcessFunction] before test
  1> [2025-03-15T02:49:47,027][INFO ][o.o.m.r.RestCohereInferenceIT] [test_cohereInference_withDifferent_postProcessFunction] initializing REST clients against [http://[::1]:37889, http://127.0.0.1:38665,/ http://[::1]:38649, http://127.0.0.1:40951,/ http://[::1]:45781, http://127.0.0.1:41155]/
  1> [2025-03-15T02:49:49,866][INFO ][o.o.m.r.RestCohereInferenceIT] [test_cohereInference_withDifferent_postProcessFunction] after test


Suite: Test class org.opensearch.ml.rest.RestCohereInferenceIT
  2> REPRODUCE WITH: ./gradlew ':opensearch-ml-plugin:integTest' --tests "org.opensearch.ml.rest.RestCohereInferenceIT.test_cohereInference_withDifferent_postProcessFunction" -Dtests.seed=F927A39C1DD2B916 -Dtests.security.manager=false -Dtests.locale=uz-AF -Dtests.timezone=Europe/Zurich -Druntime.java=21
  2> java.lang.NullPointerException: Cannot invoke "Object.getClass()" because the return value of "java.util.Map.get(Object)" is null
        at __randomizedtesting.SeedInfo.seed([F927A39C1DD2B916:AA93BEE3AD0CC3E]:0)
        at org.opensearch.ml.rest.RestCohereInferenceIT.validateOutput(RestCohereInferenceIT.java:93)
        at org.opensearch.ml.rest.RestCohereInferenceIT.test_cohereInference_withDifferent_postProcessFunction(RestCohereInferenceIT.java:81)
  2> NOTE: leaving temporary files on disk at: /__w/ml-commons/ml-commons/plugin/build/testrun/integTest/temp/org.opensearch.ml.rest.RestCohereInferenceIT_F927A39C1DD2B916-001
  2> NOTE: test params are: codec=Asserting(Lucene101): {}, docValues:{}, maxPointsInLeafNode=811, maxMBSortInHeap=6.325716610231618, sim=Asserting(RandomSimilarity(queryNorm=false): {}), locale=uz-AF, timezone=Europe/Zurich
  2> NOTE: Linux 6.8.0-1021-azure amd64/Azul Systems, Inc. 21.0.6 (64-bit)/cpus=4,threads=1,free=383631592,total=536870912
  2> NOTE: All tests run in this JVM: [MLModelAutoReDeployerIT, RestBedRockInferenceIT, RestCohereInferenceIT]

RestConnectorToolIT > testConnectorToolInFlowAgent STANDARD_OUT
    [2025-03-15T04:49:49,965][INFO ][o.o.m.r.RestConnectorToolIT] [testConnectorToolInFlowAgent] before test
    [2025-03-15T04:49:49,972][INFO ][o.o.m.r.RestConnectorToolIT] [testConnectorToolInFlowAgent] initializing REST clients against [http://[::1]:37889, http://127.0.0.1:38665,/ http://[::1]:38649, http://127.0.0.1:40951,/ http://[::1]:45781, http://127.0.0.1:41155]/

RestConnectorToolIT > testConnectorToolInFlowAgent STANDARD_ERROR
    CKI 15, 2025 4:50:11 COMME org.opensearch.client.RestClient logResponse
    WARNING: request [DELETE http://127.0.0.1:40951/.plugins-ml-agent] returned 1 warnings: [299 OpenSearch-3.0.0-SNAPSHOT-127501789334d6deb19d206bf76d8475a9e27c54 "this request accesses system indices: [.plugins-ml-agent], but in a future major version, direct access to system indices will be prevented by default"]

@nathaliellenaa nathaliellenaa had a problem deploying to ml-commons-cicd-env-require-approval March 15, 2025 20:16 — with GitHub Actions Failure
@nathaliellenaa
Copy link
Contributor Author

Sure @mingshl, I'll take a look at it.

@zane-neo
Copy link
Collaborator

@nathaliellenaa can you look into this flaky test, seems an easy fix

just try to add a null check before getting it

REPRODUCE WITH: ./gradlew ':opensearch-ml-plugin:integTest' --tests "org.opensearch.ml.rest.RestCohereInferenceIT.test_cohereInference_withDifferent_postProcessFunction" -Dtests.seed=F927A39C1DD2B916 -Dtests.security.manager=false -Dtests.locale=uz-AF -Dtests.timezone=Europe/Zurich -Druntime.java=21
RestCohereInferenceIT > test_cohereInference_withDifferent_postProcessFunction STANDARD_ERROR
    REPRODUCE WITH: ./gradlew ':opensearch-ml-plugin:integTest' --tests "org.opensearch.ml.rest.RestCohereInferenceIT.test_cohereInference_withDifferent_postProcessFunction" -Dtests.seed=F927A39C1DD2B916 -Dtests.security.manager=false -Dtests.locale=uz-AF -Dtests.timezone=Europe/Zurich -Druntime.java=21

RestCohereInferenceIT > test_cohereInference_withDifferent_postProcessFunction FAILED
    java.lang.NullPointerException: Cannot invoke "Object.getClass()" because the return value of "java.util.Map.get(Object)" is null
        at __randomizedtesting.SeedInfo.seed([F927A39C1DD2B916:AA93BEE3AD0CC3E]:0)
        at org.opensearch.ml.rest.RestCohereInferenceIT.validateOutput(RestCohereInferenceIT.java:93)
        at org.opensearch.ml.rest.RestCohereInferenceIT.test_cohereInference_withDifferent_postProcessFunction(RestCohereInferenceIT.java:81)

Did this PR rebased main branch? This failure IT should already fixed in this PR: https://github.com/opensearch-project/ml-commons/pull/3602/files#diff-56fa2e9a8e21fb83bd72bc755fdb3347504abdde7d47351cdee6e8ee71154697R93, if you check the current IT file: https://github.com/opensearch-project/ml-commons/blob/main/plugin/src/test/java/org/opensearch/ml/rest/RestCohereInferenceIT.java, there's no line 93.

@nathaliellenaa
Copy link
Contributor Author

Thanks for the information @zane-neo. Will try to rebase and run the workflow again.

@nathaliellenaa nathaliellenaa force-pushed the update-pytorch-version branch from 6253623 to 68e1089 Compare March 18, 2025 16:50
@nathaliellenaa nathaliellenaa had a problem deploying to ml-commons-cicd-env-require-approval March 18, 2025 16:51 — with GitHub Actions Error
@nathaliellenaa nathaliellenaa had a problem deploying to ml-commons-cicd-env-require-approval March 18, 2025 16:51 — with GitHub Actions Error
@nathaliellenaa nathaliellenaa had a problem deploying to ml-commons-cicd-env-require-approval March 18, 2025 16:51 — with GitHub Actions Failure
@nathaliellenaa nathaliellenaa had a problem deploying to ml-commons-cicd-env-require-approval March 18, 2025 16:51 — with GitHub Actions Failure
@mingshl
Copy link
Collaborator

mingshl commented Mar 18, 2025

after rebased, there is compiling issue, @nathaliellenaa would you check this class in core, org.opensearch.cluster.node.DiscoveryNodeRole and see there is dependency changes?

> Task :opensearch-ml-plugin:compileTestJava
/__w/ml-commons/ml-commons/plugin/src/test/java/org/opensearch/ml/utils/TestHelper.java:15: error: cannot find symbol
import static org.opensearch.cluster.node.DiscoveryNodeRole.SEARCH_ROLE;
^
  symbol:   static SEARCH_ROLE
  location: class DiscoveryNodeRole
/__w/ml-commons/ml-commons/plugin/src/test/java/org/opensearch/ml/utils/TestHelper.java:108: error: cannot find symbol
            new TreeSet<>(Arrays.asList(DATA_ROLE, INGEST_ROLE, CLUSTER_MANAGER_ROLE, REMOTE_CLUSTER_CLIENT_ROLE, SEARCH_ROLE, ML_ROLE))
                                                                                                                  ^
  symbol:   variable SEARCH_ROLE
  location: class TestHelper
/__w/ml-commons/ml-commons/plugin/src/test/java/org/opensearch/ml/rest/MLCommonsRestTestCase.java:927: error: cannot find symbol
            log.error(e.getMessage(), e);
            ^
  symbol:   variable log
  location: class MLCommonsRestTestCase
/__w/ml-commons/ml-commons/plugin/src/test/java/org/opensearch/ml/rest/RestCohereInferenceIT.java:55: error: cannot find symbol
            log.info("COHERE_KEY is null, skipping the test!");
            ^

@dhrubo-os
Copy link
Collaborator

after rebased, there is compiling issue, @nathaliellenaa would you check this class in core, org.opensearch.cluster.node.DiscoveryNodeRole and see there is dependency changes?

> Task :opensearch-ml-plugin:compileTestJava
/__w/ml-commons/ml-commons/plugin/src/test/java/org/opensearch/ml/utils/TestHelper.java:15: error: cannot find symbol
import static org.opensearch.cluster.node.DiscoveryNodeRole.SEARCH_ROLE;
^
  symbol:   static SEARCH_ROLE
  location: class DiscoveryNodeRole
/__w/ml-commons/ml-commons/plugin/src/test/java/org/opensearch/ml/utils/TestHelper.java:108: error: cannot find symbol
            new TreeSet<>(Arrays.asList(DATA_ROLE, INGEST_ROLE, CLUSTER_MANAGER_ROLE, REMOTE_CLUSTER_CLIENT_ROLE, SEARCH_ROLE, ML_ROLE))
                                                                                                                  ^
  symbol:   variable SEARCH_ROLE
  location: class TestHelper
/__w/ml-commons/ml-commons/plugin/src/test/java/org/opensearch/ml/rest/MLCommonsRestTestCase.java:927: error: cannot find symbol
            log.error(e.getMessage(), e);
            ^
  symbol:   variable log
  location: class MLCommonsRestTestCase
/__w/ml-commons/ml-commons/plugin/src/test/java/org/opensearch/ml/rest/RestCohereInferenceIT.java:55: error: cannot find symbol
            log.info("COHERE_KEY is null, skipping the test!");
            ^

We need to merge this PR: #3667

@mingshl
Copy link
Collaborator

mingshl commented Mar 18, 2025

after rebased, there is compiling issue, @nathaliellenaa would you check this class in core, org.opensearch.cluster.node.DiscoveryNodeRole and see there is dependency changes?

> Task :opensearch-ml-plugin:compileTestJava
/__w/ml-commons/ml-commons/plugin/src/test/java/org/opensearch/ml/utils/TestHelper.java:15: error: cannot find symbol
import static org.opensearch.cluster.node.DiscoveryNodeRole.SEARCH_ROLE;
^
  symbol:   static SEARCH_ROLE
  location: class DiscoveryNodeRole
/__w/ml-commons/ml-commons/plugin/src/test/java/org/opensearch/ml/utils/TestHelper.java:108: error: cannot find symbol
            new TreeSet<>(Arrays.asList(DATA_ROLE, INGEST_ROLE, CLUSTER_MANAGER_ROLE, REMOTE_CLUSTER_CLIENT_ROLE, SEARCH_ROLE, ML_ROLE))
                                                                                                                  ^
  symbol:   variable SEARCH_ROLE
  location: class TestHelper
/__w/ml-commons/ml-commons/plugin/src/test/java/org/opensearch/ml/rest/MLCommonsRestTestCase.java:927: error: cannot find symbol
            log.error(e.getMessage(), e);
            ^
  symbol:   variable log
  location: class MLCommonsRestTestCase
/__w/ml-commons/ml-commons/plugin/src/test/java/org/opensearch/ml/rest/RestCohereInferenceIT.java:55: error: cannot find symbol
            log.info("COHERE_KEY is null, skipping the test!");
            ^

We need to merge this PR: #3667

merged, @nathaliellenaa try rebase with the commit from #3667

Signed-off-by: Nathalie Jonathan <nathhjo@amazon.com>
@nathaliellenaa nathaliellenaa force-pushed the update-pytorch-version branch from 68e1089 to 28ff492 Compare March 18, 2025 17:36
@nathaliellenaa nathaliellenaa had a problem deploying to ml-commons-cicd-env-require-approval March 18, 2025 17:37 — with GitHub Actions Error
@nathaliellenaa nathaliellenaa temporarily deployed to ml-commons-cicd-env-require-approval March 18, 2025 17:37 — with GitHub Actions Inactive
@nathaliellenaa nathaliellenaa had a problem deploying to ml-commons-cicd-env-require-approval March 18, 2025 17:37 — with GitHub Actions Failure
@nathaliellenaa nathaliellenaa temporarily deployed to ml-commons-cicd-env-require-approval March 18, 2025 17:37 — with GitHub Actions Inactive
@nathaliellenaa nathaliellenaa requested a deployment to ml-commons-cicd-env-require-approval March 18, 2025 20:52 — with GitHub Actions Waiting
@nathaliellenaa nathaliellenaa requested a deployment to ml-commons-cicd-env-require-approval March 18, 2025 20:52 — with GitHub Actions Waiting
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEATURE] Upgrade Pytorch Version
6 participants