Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WanDB callback fails on training end when eval dataset is provided #34701

Open
2 of 4 tasks
eyalmazuz opened this issue Nov 12, 2024 · 0 comments
Open
2 of 4 tasks

WanDB callback fails on training end when eval dataset is provided #34701

eyalmazuz opened this issue Nov 12, 2024 · 0 comments
Labels

Comments

@eyalmazuz
Copy link

System Info

  • transformers version: 4.46.2
  • Platform: Linux-5.14.0-427.22.1.el9_4.x86_64-x86_64-with-glibc2.34
  • Python version: 3.11.10
  • Huggingface_hub version: 0.26.1
  • Safetensors version: 0.4.5
  • Accelerate version: 1.1.1
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.5.1+cu124 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using distributed or parallel set-up in script?: no

Who can help?

@muellerzr @SunMarc

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

(I reduced the code to the relevant parts)

    train_args = TrainingArguments(
        num_train_epochs=50,
        eval_strategy="epoch",
        logging_strategy="epoch",
        save_strategy="epoch",
        save_total_limit=3,
        report_to="wandb",
        run_name=name,
    )

    trainer = Trainer(
        args=train_args,
        train_dataset=train_dataset,
        eval_dataset=test_dataset,
    )

The issue is when reporting to WanDB, the callback at the following line of code

fake_trainer = Trainer(args=args, model=model, processing_class=tokenizer)

creates a fake trainer

fake_trainer = Trainer(args=args, model=model, processing_class=tokenizer)

with the same as the training arguments
but it isn't providing any datasets to the fake trainer
but because my script defines eval_strategy to anything other than no, and because WanDB reporting is defined

it throws the following error at the end of the training

105   File "/home/mazuze/NLP/Hebrew-LLM-Eval/sentence_ordering/train_model.py", line 278, in main
106     trainer.train()
107   File "/home/mazuze/.conda/envs/coherence/lib/python3.11/site-packages/transformers/trainer.py", line 2123, in train
108     return inner_training_loop(
109            ^^^^^^^^^^^^^^^^^^^^
110   File "/home/mazuze/.conda/envs/coherence/lib/python3.11/site-packages/transformers/trainer.py", line 2635, in _inner_training_loop
111     self.control = self.callback_handler.on_train_end(args, self.state, self.control)
112                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
113   File "/home/mazuze/.conda/envs/coherence/lib/python3.11/site-packages/transformers/trainer_callback.py", line 471, in on_train_end
114     return self.call_event("on_train_end", args, state, control)
115            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
116   File "/home/mazuze/.conda/envs/coherence/lib/python3.11/site-packages/transformers/trainer_callback.py", line 518, in call_event
117     result = getattr(callback, event)(
118              ^^^^^^^^^^^^^^^^^^^^^^^^^
119   File "/home/mazuze/.conda/envs/coherence/lib/python3.11/site-packages/transformers/integrations/integration_utils.py", line 919, in on_train_end
120     fake_trainer = Trainer(args=args, model=model, processing_class=tokenizer)
121                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
122   File "/home/mazuze/.conda/envs/coherence/lib/python3.11/site-packages/transformers/utils/deprecation.py", line 165, in wrapped_func
123     return func(*args, **kwargs)
124            ^^^^^^^^^^^^^^^^^^^^^
125   File "/home/mazuze/.conda/envs/coherence/lib/python3.11/site-packages/transformers/trainer.py", line 418, in __init__
126     raise ValueError(
127 ValueError: You have set `args.eval_strategy` to IntervalStrategy.EPOCH but you didn't pass an `eval_dataset` to `Trainer`. Either set `args.eval_strategy` to `no` or pass an `eval_dataset`.

Expected behavior

To not throw an exception and run the "on training end" successfully

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant