You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have 2 x A100 GPUs with a total of 160GB VRAM. I attempted to fine-tune the OpenGVLab/InternVL-Chat-V1-5 model by following the steps provided in the InternVL documentation. However, when I ran the script in shell mode, I encountered the following error.
(ft-env) root@07c329c33692:/workspace/ft_InternVL/InternVL/internvl_chat# GPUS=2 PER_DEVICE_BATCH_SIZE=2 sh shell/internvl1.5/2nd_finetune/internvl_chat_v1_5_internlm2_20b_dynamic_res_2nd_finetune_lora.sh
+ GPUS=2
+ BATCH_SIZE=16
+ PER_DEVICE_BATCH_SIZE=2
+ GRADIENT_ACC=4
+ pwd
+ export PYTHONPATH=:/workspace/ft_InternVL/InternVL/internvl_chat
+ export MASTER_PORT=34229
+ export TF_CPP_MIN_LOG_LEVEL=3
+ export LAUNCHER=pytorch
+ OUTPUT_DIR=work_dirs/internvl_chat_v1_5/internvl_chat_v1_5_internlm2_20b_dynamic_res_2nd_finetune_lora
+ [ ! -d work_dirs/internvl_chat_v1_5/internvl_chat_v1_5_internlm2_20b_dynamic_res_2nd_finetune_lora ]
+ mkdir -p work_dirs/internvl_chat_v1_5/internvl_chat_v1_5_internlm2_20b_dynamic_res_2nd_finetune_lora
+ torchrun --nnodes=1 --node_rank=0 --master_addr=127.0.0.1 --nproc_per_node=2 --master_port=34229 internvl/train/internvl_chat_finetune.py --model_name_or_path /workspace/huggingface_cache/hub/models--OpenGVLab--InternVL-Chat-V1-5 --conv_style internlm2-chat --output_dir work_dirs/internvl_chat_v1_5/internvl_chat_v1_5_internlm2_20b_dynamic_res_2nd_finetune_lora --meta_path ./shell/data/internvl_1_2_finetune_custom.json --overwrite_output_dir True --force_image_size 448 --max_dynamic_patch 12 --down_sample_ratio 0.5 --drop_path_rate 0.0 --freeze_llm True --freeze_mlp True --freeze_backbone True --use_llm_lora 16 --vision_select_layer -1 --dataloader_num_workers 4 --bf16 True --num_train_epochs 1 --per_device_train_batch_size 2 --gradient_accumulation_steps 4 --evaluation_strategy no --save_strategy steps --save_steps 200 --save_total_limit 1 --learning_rate 2e-5 --weight_decay 0.05 --warmup_ratio 0.03 --lr_scheduler_type cosine --logging_steps 1 --max_seq_length 4096 --do_train True+ --grad_checkpoint True --group_by_length True --dynamic_image_size True --use_thumbnail True --ps_version v2 --deepspeed zero_stage3_config.json --report_to tensorboard
tee -a work_dirs/internvl_chat_v1_5/internvl_chat_v1_5_internlm2_20b_dynamic_res_2nd_finetune_lora/training_log.txt
W0318 06:16:30.522000 3264 torch/distributed/run.py:792]
W0318 06:16:30.522000 3264 torch/distributed/run.py:792] *****************************************
W0318 06:16:30.522000 3264 torch/distributed/run.py:792] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0318 06:16:30.522000 3264 torch/distributed/run.py:792] *****************************************
[2025-03-18 06:16:36,811] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-03-18 06:16:37,037] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
W0318 06:16:37.379000 3264 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3407 closing signal SIGTERM
E0318 06:16:37.598000 3264 torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: -11) local_rank: 0 (pid: 3406) of binary: /workspace/ft_InternVL/ft-env/bin/python
Traceback (most recent call last):
File "/workspace/ft_InternVL/ft-env/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/workspace/ft_InternVL/ft-env/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
return f(*args, **kwargs)
File "/workspace/ft_InternVL/ft-env/lib/python3.10/site-packages/torch/distributed/run.py", line 918, in main
run(args)
File "/workspace/ft_InternVL/ft-env/lib/python3.10/site-packages/torch/distributed/run.py", line 909, in run
elastic_launch(
File "/workspace/ft_InternVL/ft-env/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/workspace/ft_InternVL/ft-env/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
======================================================
internvl/train/internvl_chat_finetune.py FAILED
------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2025-03-18_06:16:37
host : 07c329c33692
rank : 0 (local_rank: 0)
exitcode : -11 (pid: 3406)
error_file: <N/A>
traceback : Signal 11 (SIGSEGV) received by PID 3406
======================================================
The text was updated successfully, but these errors were encountered:
Hello,
I have 2 x A100 GPUs with a total of 160GB VRAM. I attempted to fine-tune the OpenGVLab/InternVL-Chat-V1-5 model by following the steps provided in the InternVL documentation. However, when I ran the script in shell mode, I encountered the following error.
The text was updated successfully, but these errors were encountered: