Skip to content

[BUG]When I use deepspeed ZeRO3 to train the vision-language-action model ,it met error of loading weights #7136

Open
@hahans

Description

@hahans

Describe the bug
I want train vision-language-action model openvla with ZeRO3 and it cant load weight
Loading checkpoint shards: 100%|██████████| 3/3 [00:26<00:00, 7.72s/it]
Loading checkpoint shards: 100%|██████████| 3/3 [00:26<00:00, 8.99s/it]
[rank0]: Traceback (most recent call last):
[rank0]: File "/fs1/private/user/chenzengjue/hdp/openvla/vla-scripts/grpo.py", line 136, in
[rank0]: main(script_args, training_args, model_args)
[rank0]: File "/fs1/private/user/chenzengjue/hdp/openvla/vla-scripts/grpo.py", line 112, in main
[rank0]: trainer = trainer_cls(
[rank0]: ^^^^^^^^^^^^
[rank0]: File "/fs1/private/user/chenzengjue/hdp/openvla/vla-scripts/grpo_trainer.py", line 186, in init
[rank0]: vla = AutoModelForVision2Seq.from_pretrained(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/chenzengjue/anaconda3/envs/r1-v/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 559, in from_pretrained
[rank0]: return model_class.from_pretrained(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/chenzengjue/anaconda3/envs/r1-v/lib/python3.11/site-packages/transformers/modeling_utils.py", line 262, in _wrapper
[rank0]: return func(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/chenzengjue/anaconda3/envs/r1-v/lib/python3.11/site-packages/transformers/modeling_utils.py", line 4313, in from_pretrained
[rank0]: ) = cls._load_pretrained_model(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/chenzengjue/anaconda3/envs/r1-v/lib/python3.11/site-packages/transformers/modeling_utils.py", line 4949, in _load_pretrained_model
[rank0]: raise RuntimeError(f"Error(s) in loading state_dict for {model.class.name}:\n\t{error_msg}")
[rank0]: RuntimeError: Error(s) in loading state_dict for OpenVLAForActionPrediction:
[rank0]: size mismatch for vision_backbone.featurizer.blocks.0.ls1.scale_factor: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([0]).
[rank0]: size mismatch for vision_backbone.featurizer.blocks.0.ls2.scale_factor: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([0]).
[rank0]: size mismatch for vision_backbone.featurizer.blocks.1.ls1.scale_factor: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([0]).
[rank0]: size mismatch for vision_backbone.featurizer.blocks.1.ls2.scale_factor: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([0]).
[rank0]: size mismatch for vision_backbone.featurizer.blocks.2.ls1.scale_factor: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([0]).
[rank0]: size mismatch for vision_backbone.featurizer.blocks.2.ls2.scale_factor: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([0]).
[rank0]: size mismatch for vision_backbone.featurizer.blocks.3.ls1.scale_factor: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([0]).
[rank0]: size mismatch for vision_backbone.featurizer.blocks.3.ls2.scale_factor: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([0]).
[rank0]: size mismatch for vision_backbone.featurizer.blocks.4.ls1.scale_factor: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([0]).
[rank0]: size mismatch for vision_backbone.featurizer.blocks.4.ls2.scale_factor: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([0]).
[rank0]: size mismatch for vision_backbone.featurizer.blocks.5.ls1.scale_factor: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([0]).
[rank0]: size mismatch for vision_backbone.featurizer.blocks.5.ls2.scale_factor: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([0]).
[rank0]: size mismatch for vision_backbone.featurizer.blocks.6.ls1.scale_factor:

To Reproduce
I try to rewrite the code of GRPO Trainer and also use deepspeed ZeRO3 to train the Openvla model.And follow the original way to load OpenVLA:
AutoConfig.register("openvla", OpenVLAConfig)
AutoImageProcessor.register(OpenVLAConfig, PrismaticImageProcessor)
AutoProcessor.register(OpenVLAConfig, PrismaticProcessor)
AutoModelForVision2Seq.register(OpenVLAConfig, OpenVLAForActionPrediction)

    # Load OpenVLA Processor and Model using HF AutoClasses
    quantization_config = None
    print(f"Loading model {model}...")
    processing_class = AutoProcessor.from_pretrained('/home/chenzengjue/hdp/openvla/openvla-7b', trust_remote_code=True)
    vla = AutoModelForVision2Seq.from_pretrained(
        '/home/chenzengjue/hdp/openvla/openvla-7b',
        torch_dtype=torch.bfloat16,
        quantization_config=quantization_config,
        # low_cpu_mem_usage=True,
        trust_remote_code=True,
    )
    vla = vla.to(device_id) 

Then it met error in AutoModelForVision2Seq.from_pretrained()

If we don't use deepspeed ,it will work and the error disappear.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingtraining

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions