You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Am I invoking the 2bw training correctly? Also in forward_step in pretrain_gpt2.py, the loss is being averaged across data-parallel workers in every micro-batch. Can these be combined to happen only once per batch?
The text was updated successfully, but these errors were encountered:
I ran the pipedream2bw branch with 6 pipeline stages on 48 GPUs and the loss went to nan in about 16k steps. I used the following arguments:
DISTRIBUTED_ARGS="--nproc_per_node $GPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT"
NCCL_SOCKET_IFNAME=eth0 python -m torch.distributed.launch $DISTRIBUTED_ARGS
pretrain_gpt2.py
--tensor-model-parallel-size 1
--pipeline-model-parallel-size 6
--scatter-gather-tensors-in-pipeline
--num-layers 24
--hidden-size 1024
--num-attention-heads 16
--seq-length 1024
--max-position-embeddings 1024
--micro-batch-size 4
--global-batch-size 512
--lr 0.00015
--train-iters 500000
--lr-decay-iters 320000
--lr-decay-style cosine
--min-lr 0.00001
--lr-warmup-fraction 0.01
--data-path $DATA_PATH
--vocab-file gpt2-vocab.json
--merge-file gpt2-merges.txt
--split 949,50,1
--log-interval 1
--clip-grad 1.0
--fp16
--DDP-impl local
--loss-scale 16384
--apply-query-key-layer-scaling
--bias-gelu-fusion
--bias-dropout-fusion
--exit-interval 320000
--save $CHECKPOINT_PATH
--save-interval 300
--load $CHECKPOINT_PATH
--max-num-ckpts 16
--pipeline-no-flushes
--checkpoint-activations --checkpoint-num-layers 1
Am I invoking the 2bw training correctly? Also in forward_step in pretrain_gpt2.py, the loss is being averaged across data-parallel workers in every micro-batch. Can these be combined to happen only once per batch?
The text was updated successfully, but these errors were encountered: