Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPT2 355m model convergence with 2BW training #64

Open
nitikasaran68 opened this issue Apr 5, 2021 · 0 comments
Open

GPT2 355m model convergence with 2BW training #64

nitikasaran68 opened this issue Apr 5, 2021 · 0 comments

Comments

@nitikasaran68
Copy link

I ran the pipedream2bw branch with 6 pipeline stages on 48 GPUs and the loss went to nan in about 16k steps. I used the following arguments:

DISTRIBUTED_ARGS="--nproc_per_node $GPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT"

NCCL_SOCKET_IFNAME=eth0 python -m torch.distributed.launch $DISTRIBUTED_ARGS
pretrain_gpt2.py
--tensor-model-parallel-size 1
--pipeline-model-parallel-size 6
--scatter-gather-tensors-in-pipeline
--num-layers 24
--hidden-size 1024
--num-attention-heads 16
--seq-length 1024
--max-position-embeddings 1024
--micro-batch-size 4
--global-batch-size 512
--lr 0.00015
--train-iters 500000
--lr-decay-iters 320000
--lr-decay-style cosine
--min-lr 0.00001
--lr-warmup-fraction 0.01
--data-path $DATA_PATH
--vocab-file gpt2-vocab.json
--merge-file gpt2-merges.txt
--split 949,50,1
--log-interval 1
--clip-grad 1.0
--fp16
--DDP-impl local
--loss-scale 16384
--apply-query-key-layer-scaling
--bias-gelu-fusion
--bias-dropout-fusion
--exit-interval 320000
--save $CHECKPOINT_PATH
--save-interval 300
--load $CHECKPOINT_PATH
--max-num-ckpts 16
--pipeline-no-flushes
--checkpoint-activations --checkpoint-num-layers 1

Am I invoking the 2bw training correctly? Also in forward_step in pretrain_gpt2.py, the loss is being averaged across data-parallel workers in every micro-batch. Can these be combined to happen only once per batch?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant