Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]When I use deepspeed ZeRO3 to train the vision-language-action model ,it met error of loading weights #7136

Open
hahans opened this issue Mar 14, 2025 · 0 comments
Labels
bug Something isn't working training

Comments

@hahans
Copy link

hahans commented Mar 14, 2025

Describe the bug
I want train vision-language-action model openvla with ZeRO3 and it cant load weight
Loading checkpoint shards: 100%|██████████| 3/3 [00:26<00:00, 7.72s/it]
Loading checkpoint shards: 100%|██████████| 3/3 [00:26<00:00, 8.99s/it]
[rank0]: Traceback (most recent call last):
[rank0]: File "/fs1/private/user/chenzengjue/hdp/openvla/vla-scripts/grpo.py", line 136, in
[rank0]: main(script_args, training_args, model_args)
[rank0]: File "/fs1/private/user/chenzengjue/hdp/openvla/vla-scripts/grpo.py", line 112, in main
[rank0]: trainer = trainer_cls(
[rank0]: ^^^^^^^^^^^^
[rank0]: File "/fs1/private/user/chenzengjue/hdp/openvla/vla-scripts/grpo_trainer.py", line 186, in init
[rank0]: vla = AutoModelForVision2Seq.from_pretrained(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/chenzengjue/anaconda3/envs/r1-v/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 559, in from_pretrained
[rank0]: return model_class.from_pretrained(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/chenzengjue/anaconda3/envs/r1-v/lib/python3.11/site-packages/transformers/modeling_utils.py", line 262, in _wrapper
[rank0]: return func(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/chenzengjue/anaconda3/envs/r1-v/lib/python3.11/site-packages/transformers/modeling_utils.py", line 4313, in from_pretrained
[rank0]: ) = cls._load_pretrained_model(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/chenzengjue/anaconda3/envs/r1-v/lib/python3.11/site-packages/transformers/modeling_utils.py", line 4949, in _load_pretrained_model
[rank0]: raise RuntimeError(f"Error(s) in loading state_dict for {model.class.name}:\n\t{error_msg}")
[rank0]: RuntimeError: Error(s) in loading state_dict for OpenVLAForActionPrediction:
[rank0]: size mismatch for vision_backbone.featurizer.blocks.0.ls1.scale_factor: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([0]).
[rank0]: size mismatch for vision_backbone.featurizer.blocks.0.ls2.scale_factor: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([0]).
[rank0]: size mismatch for vision_backbone.featurizer.blocks.1.ls1.scale_factor: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([0]).
[rank0]: size mismatch for vision_backbone.featurizer.blocks.1.ls2.scale_factor: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([0]).
[rank0]: size mismatch for vision_backbone.featurizer.blocks.2.ls1.scale_factor: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([0]).
[rank0]: size mismatch for vision_backbone.featurizer.blocks.2.ls2.scale_factor: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([0]).
[rank0]: size mismatch for vision_backbone.featurizer.blocks.3.ls1.scale_factor: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([0]).
[rank0]: size mismatch for vision_backbone.featurizer.blocks.3.ls2.scale_factor: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([0]).
[rank0]: size mismatch for vision_backbone.featurizer.blocks.4.ls1.scale_factor: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([0]).
[rank0]: size mismatch for vision_backbone.featurizer.blocks.4.ls2.scale_factor: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([0]).
[rank0]: size mismatch for vision_backbone.featurizer.blocks.5.ls1.scale_factor: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([0]).
[rank0]: size mismatch for vision_backbone.featurizer.blocks.5.ls2.scale_factor: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([0]).
[rank0]: size mismatch for vision_backbone.featurizer.blocks.6.ls1.scale_factor:

To Reproduce
I try to rewrite the code of GRPO Trainer and also use deepspeed ZeRO3 to train the Openvla model.And follow the original way to load OpenVLA:
AutoConfig.register("openvla", OpenVLAConfig)
AutoImageProcessor.register(OpenVLAConfig, PrismaticImageProcessor)
AutoProcessor.register(OpenVLAConfig, PrismaticProcessor)
AutoModelForVision2Seq.register(OpenVLAConfig, OpenVLAForActionPrediction)

    # Load OpenVLA Processor and Model using HF AutoClasses
    quantization_config = None
    print(f"Loading model {model}...")
    processing_class = AutoProcessor.from_pretrained('/home/chenzengjue/hdp/openvla/openvla-7b', trust_remote_code=True)
    vla = AutoModelForVision2Seq.from_pretrained(
        '/home/chenzengjue/hdp/openvla/openvla-7b',
        torch_dtype=torch.bfloat16,
        quantization_config=quantization_config,
        # low_cpu_mem_usage=True,
        trust_remote_code=True,
    )
    vla = vla.to(device_id) 

Then it met error in AutoModelForVision2Seq.from_pretrained()

If we don't use deepspeed ,it will work and the error disappear.

@hahans hahans added bug Something isn't working training labels Mar 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working training
Projects
None yet
Development

No branches or pull requests

1 participant