-
Notifications
You must be signed in to change notification settings - Fork 841
Closed
Description
这是我的命令:
nproc_per_node=2
CUDA_VISIBLE_DEVICES=0,1
NPROC_PER_NODE=$nproc_per_node
swift sft
--model /root/data3/Qwen.Qwen2.5-VL-7B-Instruct
--model_type qwen2_5_vl
--train_type lora
--max_pixel 602112
--dataset data2/train_mecg_converted_multimodal_grpo.jsonl
--torch_dtype bfloat16
--num_train_epochs 1
--per_device_train_batch_size 1
--per_device_eval_batch_size 1
--learning_rate 1e-4
--lora_rank 8
--lora_alpha 32
# --target_modules all-linear
--gradient_accumulation_steps 16
--eval_steps 100
--save_steps 100
--save_total_limit 2
--logging_steps 5
--max_length 20480
--output_dir output
--system 'You are a helpful assistant.'
--warmup_ratio 0.05
--dataloader_num_workers 4
--model_author swift
--model_name swift-robot
--deepspeed zero3
使用两张A6000微调qwen2_5vl,报oom,同样的参数使用llama-factory框架微调没有问题,并且在swift3中grpo训练也不会oom。
报错信息:
Train: 0%| | 0/349 [00:00<?, ?it/s][INFO:swift] use_logits_to_keep: False
[rank1]: Traceback (most recent call last):
[rank1]: File "/conda/envs/swift3/lib/python3.10/site-packages/swift/cli/sft.py", line 7, in
[rank1]: sft_main()
[rank1]: File "/conda/envs/swift3/lib/python3.10/site-packages/swift/llm/train/sft.py", line 274, in sft_main
[rank1]: return SwiftSft(args).main()
[rank1]: File "/conda/envs/swift3/lib/python3.10/site-packages/swift/llm/base.py", line 49, in main
[rank1]: result = self.run()
[rank1]: File "/conda/envs/swift3/lib/python3.10/site-packages/swift/llm/train/sft.py", line 122, in run
[rank1]: return self.train(trainer)
[rank1]: File "/conda/envs/swift3/lib/python3.10/site-packages/swift/llm/train/sft.py", line 183, in train
[rank1]: trainer.train(trainer.args.resume_from_checkpoint)
[rank1]: File "/conda/envs/swift3/lib/python3.10/site-packages/swift/trainers/mixin.py", line 419, in train
[rank1]: res = super().train(*args, **kwargs)
[rank1]: File "/conda/envs/swift3/lib/python3.10/site-packages/transformers/trainer.py", line 2206, in train
[rank1]: return inner_training_loop(
[rank1]: File "/conda/envs/swift3/lib/python3.10/site-packages/transformers/trainer.py", line 2548, in _inner_training_loop
[rank1]: tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
[rank1]: File "/conda/envs/swift3/lib/python3.10/site-packages/swift/trainers/trainers.py", line 368, in training_step
[rank1]: return super().training_step(model, inputs, *args, **kwargs)
[rank1]: File "/conda/envs/swift3/lib/python3.10/site-packages/transformers/trainer.py", line 3749, in training_step
[rank1]: loss = self.compute_loss(model, inputs, num_items_in_batch=num_items_in_batch)
[rank1]: File "/conda/envs/swift3/lib/python3.10/site-packages/swift/trainers/trainers.py", line 320, in compute_loss
[rank1]: outputs = model(**inputs)
[rank1]: File "/conda/envs/swift3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
[rank1]: return self._call_impl(*args, **kwargs)
[rank1]: File "/conda/envs/swift3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
[rank1]: return forward_call(*args, **kwargs)
[rank1]: File "/conda/envs/swift3/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1637, in forward
[rank1]: else self._run_ddp_forward(*inputs, **kwargs)
[rank1]: File "/conda/envs/swift3/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1464, in _run_ddp_forward
[rank1]: return self.module(*inputs, **kwargs) # type: ignore[index]
[rank1]: File "/conda/envs/swift3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
[rank1]: return self._call_impl(*args, **kwargs)
[rank1]: File "/conda/envs/swift3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1857, in _call_impl
[rank1]: return inner()
[rank1]: File "/conda/envs/swift3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1784, in inner
[rank1]: args_kwargs_result = hook(self, args, kwargs) # type: ignore[misc]
[rank1]: File "/conda/envs/swift3/lib/python3.10/site-packages/swift/llm/template/base.py", line 1285, in pre_forward_hook
[rank1]: kwargs = to_device(self._post_encode(model, old_kwargs), model.device)
[rank1]: File "/conda/envs/swift3/lib/python3.10/site-packages/swift/llm/template/template/qwen.py", line 349, in _post_encode
[rank1]: image_embeds = model.visual(pixel_values, grid_thw=image_grid_thw)
[rank1]: File "/conda/envs/swift3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
[rank1]: return self._call_impl(*args, **kwargs)
[rank1]: File "/conda/envs/swift3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
[rank1]: return forward_call(*args, **kwargs)
[rank1]: File "/conda/envs/swift3/lib/python3.10/site-packages/transformers/models/qwen2_5_vl/modeling_qwen2_5_vl.py", line 492, in forward
[rank1]: hidden_states = blk(
[rank1]: File "/conda/envs/swift3/lib/python3.10/site-packages/transformers/modeling_layers.py", line 82, in call
[rank1]: return self._gradient_checkpointing_func(partial(super().call, **kwargs), *args)
[rank1]: File "/conda/envs/swift3/lib/python3.10/site-packages/swift/trainers/mixin.py", line 355, in _new_checkpoint
[rank1]: return old_checkpoint(*args, use_reentrant=use_reentrant, **kwargs)
[rank1]: File "/conda/envs/swift3/lib/python3.10/site-packages/torch/_compile.py", line 51, in inner
[rank1]: return disable_fn(*args, **kwargs)
[rank1]: File "/conda/envs/swift3/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 838, in _fn
[rank1]: return fn(*args, **kwargs)
[rank1]: File "/conda/envs/swift3/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 495, in checkpoint
[rank1]: ret = function(*args, **kwargs)
[rank1]: File "/conda/envs/swift3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
[rank1]: return self._call_impl(*args, **kwargs)
[rank1]: File "/conda/envs/swift3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
[rank1]: return forward_call(*args, **kwargs)
[rank1]: File "/conda/envs/swift3/lib/python3.10/site-packages/transformers/models/qwen2_5_vl/modeling_qwen2_5_vl.py", line 286, in forward
[rank1]: hidden_states = hidden_states + self.attn(
[rank1]: File "/conda/envs/swift3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
[rank1]: return self._call_impl(*args, **kwargs)
[rank1]: File "/conda/envs/swift3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
[rank1]: return forward_call(*args, **kwargs)
[rank1]: File "/conda/envs/swift3/lib/python3.10/site-packages/transformers/models/qwen2_5_vl/modeling_qwen2_5_vl.py", line 248, in forward
[rank1]: attn_output, _ = attention_interface(
[rank1]: File "/conda/envs/swift3/lib/python3.10/site-packages/transformers/integrations/sdpa_attention.py", line 66, in sdpa_attention_forward
[rank1]: attn_output = torch.nn.functional.scaled_dot_product_attention(
[rank1]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.18 GiB. GPU 1 has a total capacity of 47.54 GiB of which 1.78 GiB is free. Process 30048 has 45.75 GiB memory in use. Of the allocated memory 44.70 GiB is allocated by PyTorch, and 607.34 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[INFO:swift] last_model_checkpoint: None
[INFO:swift] best_model_checkpoint: None
[INFO:swift] images_dir: /root/data5/caohanyu/cause/cause/lora_qwen-vl-0418/ms-swift-main/output/Qwen.Qwen2.5-VL-7B-Instruct/v11-20250731-124901/images
[rank0]: Traceback (most recent call last):
[rank0]: File "/conda/envs/swift3/lib/python3.10/site-packages/swift/cli/sft.py", line 7, in
[rank0]: sft_main()
[rank0]: File "/conda/envs/swift3/lib/python3.10/site-packages/swift/llm/train/sft.py", line 274, in sft_main
[rank0]: return SwiftSft(args).main()
[rank0]: File "/conda/envs/swift3/lib/python3.10/site-packages/swift/llm/base.py", line 49, in main
[rank0]: result = self.run()
[rank0]: File "/conda/envs/swift3/lib/python3.10/site-packages/swift/llm/train/sft.py", line 122, in run
[rank0]: return self.train(trainer)
[rank0]: File "/conda/envs/swift3/lib/python3.10/site-packages/swift/llm/train/sft.py", line 183, in train
[rank0]: trainer.train(trainer.args.resume_from_checkpoint)
[rank0]: File "/conda/envs/swift3/lib/python3.10/site-packages/swift/trainers/mixin.py", line 419, in train
[rank0]: res = super().train(*args, **kwargs)
[rank0]: File "/conda/envs/swift3/lib/python3.10/site-packages/transformers/trainer.py", line 2206, in train
[rank0]: return inner_training_loop(
[rank0]: File "/conda/envs/swift3/lib/python3.10/site-packages/transformers/trainer.py", line 2548, in _inner_training_loop
[rank0]: tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
[rank0]: File "/conda/envs/swift3/lib/python3.10/site-packages/swift/trainers/trainers.py", line 368, in training_step
[rank0]: return super().training_step(model, inputs, *args, **kwargs)
[rank0]: File "/conda/envs/swift3/lib/python3.10/site-packages/transformers/trainer.py", line 3749, in training_step
[rank0]: loss = self.compute_loss(model, inputs, num_items_in_batch=num_items_in_batch)
[rank0]: File "/conda/envs/swift3/lib/python3.10/site-packages/swift/trainers/trainers.py", line 320, in compute_loss
[rank0]: outputs = model(**inputs)
[rank0]: File "/conda/envs/swift3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/conda/envs/swift3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File "/conda/envs/swift3/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1637, in forward
[rank0]: else self._run_ddp_forward(*inputs, **kwargs)
[rank0]: File "/conda/envs/swift3/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1464, in _run_ddp_forward
[rank0]: return self.module(*inputs, **kwargs) # type: ignore[index]
[rank0]: File "/conda/envs/swift3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/conda/envs/swift3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1857, in _call_impl
[rank0]: return inner()
[rank0]: File "/conda/envs/swift3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1784, in inner
[rank0]: args_kwargs_result = hook(self, args, kwargs) # type: ignore[misc]
[rank0]: File "/conda/envs/swift3/lib/python3.10/site-packages/swift/llm/template/base.py", line 1285, in pre_forward_hook
[rank0]: kwargs = to_device(self._post_encode(model, old_kwargs), model.device)
[rank0]: File "/conda/envs/swift3/lib/python3.10/site-packages/swift/llm/template/template/qwen.py", line 349, in _post_encode
[rank0]: image_embeds = model.visual(pixel_values, grid_thw=image_grid_thw)
[rank0]: File "/conda/envs/swift3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/conda/envs/swift3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File "/conda/envs/swift3/lib/python3.10/site-packages/transformers/models/qwen2_5_vl/modeling_qwen2_5_vl.py", line 491, in forward
[rank0]: attention_mask = self._prepare_attention_mask(hidden_states, cu_seqlens_now)
[rank0]: File "/conda/envs/swift3/lib/python3.10/site-packages/transformers/models/qwen2_5_vl/modeling_qwen2_5_vl.py", line 434, in _prepare_attention_mask
[rank0]: attention_mask = torch.full(
[rank0]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.15 GiB. GPU 0 has a total capacity of 47.54 GiB of which 188.75 MiB is free. Process 30047 has 47.34 GiB memory in use. Of the allocated memory 45.83 GiB is allocated by PyTorch, and 1.07 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
W0731 12:50:18.246886 12774 site-packages/torch/distributed/elastic/multiprocessing/api.py:900] Sending process 12869 closing signal SIGTERM
E0731 12:50:24.026723 12774 site-packages/torch/distributed/elastic/multiprocessing/api.py:874] failed (exitcode: 1) local_rank: 1 (pid: 12870) of binary: /conda/envs/swift3/bin/python3.10
Traceback (most recent call last):
File "/conda/envs/swift3/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/conda/envs/swift3/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/conda/envs/swift3/lib/python3.10/site-packages/torch/distributed/run.py", line 896, in
main()
File "/conda/envs/swift3/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 355, in wrapper
return f(*args, **kwargs)
File "/conda/envs/swift3/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in main
run(args)
File "/conda/envs/swift3/lib/python3.10/site-packages/torch/distributed/run.py", line 883, in run
elastic_launch(
File "/conda/envs/swift3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 139, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/conda/envs/swift3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 270, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
/conda/envs/swift3/lib/python3.10/site-packages/swift/cli/sft.py FAILED
Failures:
<NO_OTHER_FAILURES>
Root Cause (first observed failure):
[0]:
time : 2025-07-31_12:50:18
host : 55e1b8138702
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 12870)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
examples/train/full/lora.sh: line 49: --gradient_accumulation_steps: command not found
(swift3) root@55e1b8138702:~/data5/caohanyu/cause/cause/lora_qwen-vl-0418/ms-swift-main#
Metadata
Metadata
Assignees
Labels
No labels