Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault (SIGSEGV, exit code -11) while running internvl_chat_v1_5_internlm2_20b_dynamic_res_2nd_finetune_lora.sh #957

Open
NmTamil2 opened this issue Mar 18, 2025 · 1 comment

Comments

@NmTamil2
Copy link

NmTamil2 commented Mar 18, 2025

Hello,

I have 2 x A100 GPUs with a total of 160GB VRAM. I attempted to fine-tune the OpenGVLab/InternVL-Chat-V1-5 model by following the steps provided in the InternVL documentation. However, when I ran the script in shell mode, I encountered the following error.

(ft-env) root@07c329c33692:/workspace/ft_InternVL/InternVL/internvl_chat# GPUS=2 PER_DEVICE_BATCH_SIZE=2 sh shell/internvl1.5/2nd_finetune/internvl_chat_v1_5_internlm2_20b_dynamic_res_2nd_finetune_lora.sh
+ GPUS=2
+ BATCH_SIZE=16
+ PER_DEVICE_BATCH_SIZE=2
+ GRADIENT_ACC=4
+ pwd
+ export PYTHONPATH=:/workspace/ft_InternVL/InternVL/internvl_chat
+ export MASTER_PORT=34229
+ export TF_CPP_MIN_LOG_LEVEL=3
+ export LAUNCHER=pytorch
+ OUTPUT_DIR=work_dirs/internvl_chat_v1_5/internvl_chat_v1_5_internlm2_20b_dynamic_res_2nd_finetune_lora
+ [ ! -d work_dirs/internvl_chat_v1_5/internvl_chat_v1_5_internlm2_20b_dynamic_res_2nd_finetune_lora ]
+ mkdir -p work_dirs/internvl_chat_v1_5/internvl_chat_v1_5_internlm2_20b_dynamic_res_2nd_finetune_lora
+ torchrun --nnodes=1 --node_rank=0 --master_addr=127.0.0.1 --nproc_per_node=2 --master_port=34229 internvl/train/internvl_chat_finetune.py --model_name_or_path /workspace/huggingface_cache/hub/models--OpenGVLab--InternVL-Chat-V1-5 --conv_style internlm2-chat --output_dir work_dirs/internvl_chat_v1_5/internvl_chat_v1_5_internlm2_20b_dynamic_res_2nd_finetune_lora --meta_path ./shell/data/internvl_1_2_finetune_custom.json --overwrite_output_dir True --force_image_size 448 --max_dynamic_patch 12 --down_sample_ratio 0.5 --drop_path_rate 0.0 --freeze_llm True --freeze_mlp True --freeze_backbone True --use_llm_lora 16 --vision_select_layer -1 --dataloader_num_workers 4 --bf16 True --num_train_epochs 1 --per_device_train_batch_size 2 --gradient_accumulation_steps 4 --evaluation_strategy no --save_strategy steps --save_steps 200 --save_total_limit 1 --learning_rate 2e-5 --weight_decay 0.05 --warmup_ratio 0.03 --lr_scheduler_type cosine --logging_steps 1 --max_seq_length 4096 --do_train True+  --grad_checkpoint True --group_by_length True --dynamic_image_size True --use_thumbnail True --ps_version v2 --deepspeed zero_stage3_config.json --report_to tensorboard
tee -a work_dirs/internvl_chat_v1_5/internvl_chat_v1_5_internlm2_20b_dynamic_res_2nd_finetune_lora/training_log.txt
W0318 06:16:30.522000 3264 torch/distributed/run.py:792] 
W0318 06:16:30.522000 3264 torch/distributed/run.py:792] *****************************************
W0318 06:16:30.522000 3264 torch/distributed/run.py:792] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W0318 06:16:30.522000 3264 torch/distributed/run.py:792] *****************************************
[2025-03-18 06:16:36,811] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-03-18 06:16:37,037] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
W0318 06:16:37.379000 3264 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3407 closing signal SIGTERM
E0318 06:16:37.598000 3264 torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: -11) local_rank: 0 (pid: 3406) of binary: /workspace/ft_InternVL/ft-env/bin/python
Traceback (most recent call last):
  File "/workspace/ft_InternVL/ft-env/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/workspace/ft_InternVL/ft-env/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
    return f(*args, **kwargs)
  File "/workspace/ft_InternVL/ft-env/lib/python3.10/site-packages/torch/distributed/run.py", line 918, in main
    run(args)
  File "/workspace/ft_InternVL/ft-env/lib/python3.10/site-packages/torch/distributed/run.py", line 909, in run
    elastic_launch(
  File "/workspace/ft_InternVL/ft-env/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/workspace/ft_InternVL/ft-env/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
======================================================
internvl/train/internvl_chat_finetune.py FAILED
------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-03-18_06:16:37
  host      : 07c329c33692
  rank      : 0 (local_rank: 0)
  exitcode  : -11 (pid: 3406)
  error_file: <N/A>
  traceback : Signal 11 (SIGSEGV) received by PID 3406
======================================================
@ZenithWisp
Copy link

@czczup @Weiyun1025 I am having the same problem. Can you please help to fix this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants