Description
Describe the bug
I want train vision-language-action model openvla with ZeRO3 and it cant load weight
Loading checkpoint shards: 100%|██████████| 3/3 [00:26<00:00, 7.72s/it]
Loading checkpoint shards: 100%|██████████| 3/3 [00:26<00:00, 8.99s/it]
[rank0]: Traceback (most recent call last):
[rank0]: File "/fs1/private/user/chenzengjue/hdp/openvla/vla-scripts/grpo.py", line 136, in
[rank0]: main(script_args, training_args, model_args)
[rank0]: File "/fs1/private/user/chenzengjue/hdp/openvla/vla-scripts/grpo.py", line 112, in main
[rank0]: trainer = trainer_cls(
[rank0]: ^^^^^^^^^^^^
[rank0]: File "/fs1/private/user/chenzengjue/hdp/openvla/vla-scripts/grpo_trainer.py", line 186, in init
[rank0]: vla = AutoModelForVision2Seq.from_pretrained(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/chenzengjue/anaconda3/envs/r1-v/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 559, in from_pretrained
[rank0]: return model_class.from_pretrained(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/chenzengjue/anaconda3/envs/r1-v/lib/python3.11/site-packages/transformers/modeling_utils.py", line 262, in _wrapper
[rank0]: return func(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/chenzengjue/anaconda3/envs/r1-v/lib/python3.11/site-packages/transformers/modeling_utils.py", line 4313, in from_pretrained
[rank0]: ) = cls._load_pretrained_model(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/chenzengjue/anaconda3/envs/r1-v/lib/python3.11/site-packages/transformers/modeling_utils.py", line 4949, in _load_pretrained_model
[rank0]: raise RuntimeError(f"Error(s) in loading state_dict for {model.class.name}:\n\t{error_msg}")
[rank0]: RuntimeError: Error(s) in loading state_dict for OpenVLAForActionPrediction:
[rank0]: size mismatch for vision_backbone.featurizer.blocks.0.ls1.scale_factor: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([0]).
[rank0]: size mismatch for vision_backbone.featurizer.blocks.0.ls2.scale_factor: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([0]).
[rank0]: size mismatch for vision_backbone.featurizer.blocks.1.ls1.scale_factor: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([0]).
[rank0]: size mismatch for vision_backbone.featurizer.blocks.1.ls2.scale_factor: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([0]).
[rank0]: size mismatch for vision_backbone.featurizer.blocks.2.ls1.scale_factor: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([0]).
[rank0]: size mismatch for vision_backbone.featurizer.blocks.2.ls2.scale_factor: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([0]).
[rank0]: size mismatch for vision_backbone.featurizer.blocks.3.ls1.scale_factor: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([0]).
[rank0]: size mismatch for vision_backbone.featurizer.blocks.3.ls2.scale_factor: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([0]).
[rank0]: size mismatch for vision_backbone.featurizer.blocks.4.ls1.scale_factor: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([0]).
[rank0]: size mismatch for vision_backbone.featurizer.blocks.4.ls2.scale_factor: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([0]).
[rank0]: size mismatch for vision_backbone.featurizer.blocks.5.ls1.scale_factor: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([0]).
[rank0]: size mismatch for vision_backbone.featurizer.blocks.5.ls2.scale_factor: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([0]).
[rank0]: size mismatch for vision_backbone.featurizer.blocks.6.ls1.scale_factor:
To Reproduce
I try to rewrite the code of GRPO Trainer and also use deepspeed ZeRO3 to train the Openvla model.And follow the original way to load OpenVLA:
AutoConfig.register("openvla", OpenVLAConfig)
AutoImageProcessor.register(OpenVLAConfig, PrismaticImageProcessor)
AutoProcessor.register(OpenVLAConfig, PrismaticProcessor)
AutoModelForVision2Seq.register(OpenVLAConfig, OpenVLAForActionPrediction)
# Load OpenVLA Processor and Model using HF AutoClasses
quantization_config = None
print(f"Loading model {model}...")
processing_class = AutoProcessor.from_pretrained('/home/chenzengjue/hdp/openvla/openvla-7b', trust_remote_code=True)
vla = AutoModelForVision2Seq.from_pretrained(
'/home/chenzengjue/hdp/openvla/openvla-7b',
torch_dtype=torch.bfloat16,
quantization_config=quantization_config,
# low_cpu_mem_usage=True,
trust_remote_code=True,
)
vla = vla.to(device_id)
Then it met error in AutoModelForVision2Seq.from_pretrained()
If we don't use deepspeed ,it will work and the error disappear.