CI fails on few test_training_gradient_checkpointing tests for LLAMA #34722

dvrogozh · 2024-11-14T00:05:53Z

With:

On:

Nvidia A10
Intel Max Series

The following 3 tests are failing:

tests/models/llama/test_modeling_llama.py::LlamaModelTest::test_training_gradient_checkpointing
tests/models/llama/test_modeling_llama.py::LlamaModelTest::test_training_gradient_checkpointing_use_reentrant
tests/models/llama/test_modeling_llama.py::LlamaModelTest::test_training_gradient_checkpointing_use_reentrant_false

See log of the failure below. That is a regressions after this commit: 19d58d3, PR:

Add MLLama #33703

$ git bisect log
git bisect start
# bad: [33eef992503689ba1af98090e26d3e98865b2a9b] Agents: Small fixes in streaming to gradio + add tests (#34549)
git bisect bad 33eef992503689ba1af98090e26d3e98865b2a9b
# good: [984bc11b0882ff1e5b34ba717ea357e069ceced9] Revert "fixes to properly shard FSDP across cpu and meta for cpu_effcient_loading for prequantized 4bit (#32276)" (#32477)
git bisect good 984bc11b0882ff1e5b34ba717ea357e069ceced9
# good: [80b90e7b2f7466ffb1d9036986e0699880d34284] Add codestral mamba2 (#32080)
git bisect good 80b90e7b2f7466ffb1d9036986e0699880d34284
# bad: [1ec7a70fef4158ab1ed660cba5126c8cde08c7e8] fix trainer tr_loss add error (#33651)
git bisect bad 1ec7a70fef4158ab1ed660cba5126c8cde08c7e8
# good: [7ed9789e210d8eca797fc21b9c783b1ce718ecb5] Fix: `num_logits_to_keep` in composite models (#33168)
git bisect good 7ed9789e210d8eca797fc21b9c783b1ce718ecb5
# good: [bcf8946f0acb578c534b1d33d534450d1fc88507] Fix number of patch check for different vision feature select strategy (#32494)
git bisect good bcf8946f0acb578c534b1d33d534450d1fc88507
# good: [dc8b6eaeeeb59dd3089b478cc09b577f2c62a297] Fix contrastive search to correctly handle input with padding (#33507)
git bisect good dc8b6eaeeeb59dd3089b478cc09b577f2c62a297
# good: [fa0bb0fe762c757203565a940c6e59a8d27537c4] Fix ByteLevel alphabet missing when Sequence pretokenizer is used (#33556)
git bisect good fa0bb0fe762c757203565a940c6e59a8d27537c4
# bad: [19d58d31f19049e8280ccb62a5b098d89909bf5a] Add MLLama (#33703)
git bisect bad 19d58d31f19049e8280ccb62a5b098d89909bf5a
# good: [7e638ef2b8650aaa3e3a8e575bb63af262a43d95] fix code quality after merge
git bisect good 7e638ef2b8650aaa3e3a8e575bb63af262a43d95
# good: [61e98cb957862d679c4a338319a386da197b8073] Add SDPA support for M2M100 (#33309)
git bisect good 61e98cb957862d679c4a338319a386da197b8073
# good: [ade9e0fe41a414c6a24a03a79c15798db609a6c9] Corrected max number for bf16 in transformer/docs (#33658)
git bisect good ade9e0fe41a414c6a24a03a79c15798db609a6c9
# good: [94f18cf23c128055a984ffbe9c57df133c1f6cc7] Add OmDet-Turbo (#31843)
git bisect good 94f18cf23c128055a984ffbe9c57df133c1f6cc7
# first bad commit: [19d58d31f19049e8280ccb62a5b098d89909bf5a] Add MLLama (#33703)

Log for one of the failures (others are similar):

$ python3 -m pytest --pspec tests/models/llama/test_modeling_llama.py::LlamaModelTest::test_training_gradient_checkpointing                  ============================================================================================ test session starts ============================================================================================
platform linux -- Python 3.10.12, pytest-7.4.4, pluggy-1.5.0
rootdir: /home/dvrogozh/git/huggingface/transformers
configfile: pyproject.toml
plugins: hypothesis-6.111.1, subtests-0.13.1, rich-0.1.1, dash-2.17.1, xdist-3.6.1, pspec-0.0.4, timeout-2.3.1
collected 1 item

tests/models/llama/test_modeling_llama.py                                                                                                                                                                    
Llama Model Test
 » training gradient checkpointing
                                                                                                                                                                                                      [100%]

================================================================================================== ERRORS ===================================================================================================
_________________________________________________________________ ERROR at teardown of LlamaModelTest.test_training_gradient_checkpointing __________________________________________________________________

self = <tests.models.llama.test_modeling_llama.LlamaModelTest testMethod=test_training_gradient_checkpointing>

    def test_training_gradient_checkpointing(self):
        # Scenario - 1 default behaviour
>       self.check_training_gradient_checkpointing()

tests/test_modeling_common.py:899:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <tests.models.llama.test_modeling_llama.LlamaModelTest testMethod=test_training_gradient_checkpointing>, gradient_checkpointing_kwargs = None

    def check_training_gradient_checkpointing(self, gradient_checkpointing_kwargs=None):
        if not self.model_tester.is_training:
            self.skipTest(reason="ModelTester is not configured to run training tests")

        for model_class in self.all_model_classes:
            with self.subTest(model_class.__name__):
                if (
                    model_class.__name__
                    in [
                        *get_values(MODEL_MAPPING_NAMES),
                        *get_values(MODEL_FOR_BACKBONE_MAPPING_NAMES),
                    ]
                    or not model_class.supports_gradient_checkpointing
                ):
                    self.skipTest(reason=f"`supports_gradient_checkpointing` is False for {model_class.__name__}.")

            config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
            config.use_cache = False
            config.return_dict = True
            model = model_class(config)

            model.to(torch_device)
            model.gradient_checkpointing_enable(gradient_checkpointing_kwargs=gradient_checkpointing_kwargs)
            model.train()

            # unfreeze additional layers
            for p in model.parameters():
                p.requires_grad_(True)

            optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

            inputs = self._prepare_for_class(inputs_dict, model_class, return_labels=True)
>           loss = model(**inputs).loss
E           AttributeError: 'BaseModelOutputWithPast' object has no attribute 'loss'

tests/test_modeling_common.py:868: AttributeError
============================================================================================= warnings summary ==============================================================================================
../../../../../usr/lib/python3.10/distutils/command/build_ext.py:13
  /usr/lib/python3.10/distutils/command/build_ext.py:13: DeprecationWarning: The distutils.sysconfig module is deprecated, use sysconfig instead
    from distutils.sysconfig import customize_compiler, get_python_version

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
========================================================================================== short test summary info ==========================================================================================
ERROR tests/models/llama/test_modeling_llama.py::LlamaModelTest::test_training_gradient_checkpointing - AttributeError: 'BaseModelOutputWithPast' object has no attribute 'loss'
================================================================================== 1 skipped, 1 warning, 1 error in 3.91s ===================================================================================
(pytorch.cuda) dvrogozh@cg-cyp-03:~/git/huggingface/transformers$

CC: @amyeroberts, @ArthurZucker

The text was updated successfully, but these errors were encountered:

19d58d3 has introduced a context manager to manage subtests of test_training_gradient_checkpointing. However, test body was not moved under the "with" statement. Thus, while tests are correctly marked as skipped, test bodies were still executed. In some cases, as with llama this caused attribute errors. Fixes: huggingface#34722 Fixes: 19d58d3 ("Add MLLama (huggingface#33703)") Signed-off-by: Dmitry Rogozhkin <[email protected]>

dvrogozh · 2024-11-14T00:10:12Z

I've filed this PR with the fix:

Fix skip of test_training_gradient_checkpointing #34723

19d58d3 has introduced a context manager to manage subtests of test_training_gradient_checkpointing. However, test body was not moved under the "with" statement. Thus, while tests are correctly marked as skipped, test bodies were still executed. In some cases, as with llama this caused attribute errors. Fixes: huggingface#34722 Fixes: 19d58d3 ("Add MLLama (huggingface#33703)") Signed-off-by: Dmitry Rogozhkin <[email protected]>

19d58d3 has introduced a context manager to manage subtests of test_training_gradient_checkpointing. However, test body was not moved under "with" statement. Thus, while tests are correctly marked as skipped, test bodies were still executed. In some cases, as with llama this caused attribute errors. Fixes: huggingface#34722 Fixes: 19d58d3 ("Add MLLama (huggingface#33703)") Signed-off-by: Dmitry Rogozhkin <[email protected]>

dvrogozh linked a pull request Nov 14, 2024 that will close this issue

Fix skip of test_training_gradient_checkpointing #34723

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI fails on few test_training_gradient_checkpointing tests for LLAMA #34722

CI fails on few test_training_gradient_checkpointing tests for LLAMA #34722

dvrogozh commented Nov 14, 2024

dvrogozh commented Nov 14, 2024

CI fails on few test_training_gradient_checkpointing tests for LLAMA #34722

CI fails on few test_training_gradient_checkpointing tests for LLAMA #34722

Comments

dvrogozh commented Nov 14, 2024

dvrogozh commented Nov 14, 2024