WanDB callback fails on training end when eval dataset is provided #34701

eyalmazuz · 2024-11-12T09:53:13Z

System Info

transformers version: 4.46.2
Platform: Linux-5.14.0-427.22.1.el9_4.x86_64-x86_64-with-glibc2.34
Python version: 3.11.10
Huggingface_hub version: 0.26.1
Safetensors version: 0.4.5
Accelerate version: 1.1.1
Accelerate config: not found
PyTorch version (GPU?): 2.5.1+cu124 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using distributed or parallel set-up in script?: no

Who can help?

@muellerzr @SunMarc

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

(I reduced the code to the relevant parts)

    train_args = TrainingArguments(
        num_train_epochs=50,
        eval_strategy="epoch",
        logging_strategy="epoch",
        save_strategy="epoch",
        save_total_limit=3,
        report_to="wandb",
        run_name=name,
    )

    trainer = Trainer(
        args=train_args,
        train_dataset=train_dataset,
        eval_dataset=test_dataset,
    )

The issue is when reporting to WanDB, the callback at the following line of code

transformers/src/transformers/integrations/integration_utils.py

Line 919 in ccbd57a

fake_trainer = Trainer(args=args, model=model, processing_class=tokenizer)

creates a fake trainer

fake_trainer = Trainer(args=args, model=model, processing_class=tokenizer)

with the same as the training arguments
but it isn't providing any datasets to the fake trainer
but because my script defines eval_strategy to anything other than no, and because WanDB reporting is defined

it throws the following error at the end of the training

105   File "/home/mazuze/NLP/Hebrew-LLM-Eval/sentence_ordering/train_model.py", line 278, in main
106     trainer.train()
107   File "/home/mazuze/.conda/envs/coherence/lib/python3.11/site-packages/transformers/trainer.py", line 2123, in train
108     return inner_training_loop(
109            ^^^^^^^^^^^^^^^^^^^^
110   File "/home/mazuze/.conda/envs/coherence/lib/python3.11/site-packages/transformers/trainer.py", line 2635, in _inner_training_loop
111     self.control = self.callback_handler.on_train_end(args, self.state, self.control)
112                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
113   File "/home/mazuze/.conda/envs/coherence/lib/python3.11/site-packages/transformers/trainer_callback.py", line 471, in on_train_end
114     return self.call_event("on_train_end", args, state, control)
115            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
116   File "/home/mazuze/.conda/envs/coherence/lib/python3.11/site-packages/transformers/trainer_callback.py", line 518, in call_event
117     result = getattr(callback, event)(
118              ^^^^^^^^^^^^^^^^^^^^^^^^^
119   File "/home/mazuze/.conda/envs/coherence/lib/python3.11/site-packages/transformers/integrations/integration_utils.py", line 919, in on_train_end
120     fake_trainer = Trainer(args=args, model=model, processing_class=tokenizer)
121                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
122   File "/home/mazuze/.conda/envs/coherence/lib/python3.11/site-packages/transformers/utils/deprecation.py", line 165, in wrapped_func
123     return func(*args, **kwargs)
124            ^^^^^^^^^^^^^^^^^^^^^
125   File "/home/mazuze/.conda/envs/coherence/lib/python3.11/site-packages/transformers/trainer.py", line 418, in __init__
126     raise ValueError(
127 ValueError: You have set `args.eval_strategy` to IntervalStrategy.EPOCH but you didn't pass an `eval_dataset` to `Trainer`. Either set `args.eval_strategy` to `no` or pass an `eval_dataset`.

Expected behavior

To not throw an exception and run the "on training end" successfully

The text was updated successfully, but these errors were encountered:

eyalmazuz added the bug label Nov 12, 2024

CezaPasc mentioned this issue Nov 13, 2024

fix(wandb): pass fake dataset to avoid exception in trainer (see #34455) #34720

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WanDB callback fails on training end when eval dataset is provided #34701

WanDB callback fails on training end when eval dataset is provided #34701

eyalmazuz commented Nov 12, 2024

WanDB callback fails on training end when eval dataset is provided #34701

WanDB callback fails on training end when eval dataset is provided #34701

Comments

eyalmazuz commented Nov 12, 2024

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior