Omni如何进行分布式训练？

我使用如下命令进行多卡分布式训练：
CUDA_VISIBLE_DEVICES=5,6 python -m torch.distributed.run \
       main.py \
       --data_root ./text_spotting_datasets/ \
       --output_folder ./output/pretrain/stage1/ \
       --train_dataset totaltext_train mlt_train ic13_train ic15_train syntext1_train syntext2_train \
       --lr 0.0005 \
       --max_steps 400000 \
       --warmup_steps 5000 \
       --checkpoint_freq 10000 \
       --batch_size 6 \
       --tfm_pre_norm \
       --train_max_size 768 \
       --rec_loss_weight 2 \
       --use_fpn \
       --use_char_window_prompt
但是实际上只有5号卡在训练，6号卡没有显存占用

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Omni如何进行分布式训练？ #210

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Omni如何进行分布式训练？ #210

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions