You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
I want train vision-language-action model openvla with ZeRO3 and it cant load weight
Loading checkpoint shards: 100%|██████████| 3/3 [00:26<00:00, 7.72s/it]
Loading checkpoint shards: 100%|██████████| 3/3 [00:26<00:00, 8.99s/it]
[rank0]: Traceback (most recent call last):
[rank0]: File "/fs1/private/user/chenzengjue/hdp/openvla/vla-scripts/grpo.py", line 136, in
[rank0]: main(script_args, training_args, model_args)
[rank0]: File "/fs1/private/user/chenzengjue/hdp/openvla/vla-scripts/grpo.py", line 112, in main
[rank0]: trainer = trainer_cls(
[rank0]: ^^^^^^^^^^^^
[rank0]: File "/fs1/private/user/chenzengjue/hdp/openvla/vla-scripts/grpo_trainer.py", line 186, in init
[rank0]: vla = AutoModelForVision2Seq.from_pretrained(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/chenzengjue/anaconda3/envs/r1-v/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 559, in from_pretrained
[rank0]: return model_class.from_pretrained(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/chenzengjue/anaconda3/envs/r1-v/lib/python3.11/site-packages/transformers/modeling_utils.py", line 262, in _wrapper
[rank0]: return func(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/chenzengjue/anaconda3/envs/r1-v/lib/python3.11/site-packages/transformers/modeling_utils.py", line 4313, in from_pretrained
[rank0]: ) = cls._load_pretrained_model(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/chenzengjue/anaconda3/envs/r1-v/lib/python3.11/site-packages/transformers/modeling_utils.py", line 4949, in _load_pretrained_model
[rank0]: raise RuntimeError(f"Error(s) in loading state_dict for {model.class.name}:\n\t{error_msg}")
[rank0]: RuntimeError: Error(s) in loading state_dict for OpenVLAForActionPrediction:
[rank0]: size mismatch for vision_backbone.featurizer.blocks.0.ls1.scale_factor: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([0]).
[rank0]: size mismatch for vision_backbone.featurizer.blocks.0.ls2.scale_factor: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([0]).
[rank0]: size mismatch for vision_backbone.featurizer.blocks.1.ls1.scale_factor: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([0]).
[rank0]: size mismatch for vision_backbone.featurizer.blocks.1.ls2.scale_factor: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([0]).
[rank0]: size mismatch for vision_backbone.featurizer.blocks.2.ls1.scale_factor: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([0]).
[rank0]: size mismatch for vision_backbone.featurizer.blocks.2.ls2.scale_factor: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([0]).
[rank0]: size mismatch for vision_backbone.featurizer.blocks.3.ls1.scale_factor: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([0]).
[rank0]: size mismatch for vision_backbone.featurizer.blocks.3.ls2.scale_factor: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([0]).
[rank0]: size mismatch for vision_backbone.featurizer.blocks.4.ls1.scale_factor: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([0]).
[rank0]: size mismatch for vision_backbone.featurizer.blocks.4.ls2.scale_factor: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([0]).
[rank0]: size mismatch for vision_backbone.featurizer.blocks.5.ls1.scale_factor: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([0]).
[rank0]: size mismatch for vision_backbone.featurizer.blocks.5.ls2.scale_factor: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([0]).
[rank0]: size mismatch for vision_backbone.featurizer.blocks.6.ls1.scale_factor:
To Reproduce
I try to rewrite the code of GRPO Trainer and also use deepspeed ZeRO3 to train the Openvla model.And follow the original way to load OpenVLA:
AutoConfig.register("openvla", OpenVLAConfig)
AutoImageProcessor.register(OpenVLAConfig, PrismaticImageProcessor)
AutoProcessor.register(OpenVLAConfig, PrismaticProcessor)
AutoModelForVision2Seq.register(OpenVLAConfig, OpenVLAForActionPrediction)
# Load OpenVLA Processor and Model using HF AutoClasses
quantization_config = None
print(f"Loading model {model}...")
processing_class = AutoProcessor.from_pretrained('/home/chenzengjue/hdp/openvla/openvla-7b', trust_remote_code=True)
vla = AutoModelForVision2Seq.from_pretrained(
'/home/chenzengjue/hdp/openvla/openvla-7b',
torch_dtype=torch.bfloat16,
quantization_config=quantization_config,
# low_cpu_mem_usage=True,
trust_remote_code=True,
)
vla = vla.to(device_id)
Then it met error in AutoModelForVision2Seq.from_pretrained()
If we don't use deepspeed ,it will work and the error disappear.
The text was updated successfully, but these errors were encountered:
Describe the bug
I want train vision-language-action model openvla with ZeRO3 and it cant load weight
Loading checkpoint shards: 100%|██████████| 3/3 [00:26<00:00, 7.72s/it]
Loading checkpoint shards: 100%|██████████| 3/3 [00:26<00:00, 8.99s/it]
[rank0]: Traceback (most recent call last):
[rank0]: File "/fs1/private/user/chenzengjue/hdp/openvla/vla-scripts/grpo.py", line 136, in
[rank0]: main(script_args, training_args, model_args)
[rank0]: File "/fs1/private/user/chenzengjue/hdp/openvla/vla-scripts/grpo.py", line 112, in main
[rank0]: trainer = trainer_cls(
[rank0]: ^^^^^^^^^^^^
[rank0]: File "/fs1/private/user/chenzengjue/hdp/openvla/vla-scripts/grpo_trainer.py", line 186, in init
[rank0]: vla = AutoModelForVision2Seq.from_pretrained(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/chenzengjue/anaconda3/envs/r1-v/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 559, in from_pretrained
[rank0]: return model_class.from_pretrained(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/chenzengjue/anaconda3/envs/r1-v/lib/python3.11/site-packages/transformers/modeling_utils.py", line 262, in _wrapper
[rank0]: return func(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/chenzengjue/anaconda3/envs/r1-v/lib/python3.11/site-packages/transformers/modeling_utils.py", line 4313, in from_pretrained
[rank0]: ) = cls._load_pretrained_model(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/chenzengjue/anaconda3/envs/r1-v/lib/python3.11/site-packages/transformers/modeling_utils.py", line 4949, in _load_pretrained_model
[rank0]: raise RuntimeError(f"Error(s) in loading state_dict for {model.class.name}:\n\t{error_msg}")
[rank0]: RuntimeError: Error(s) in loading state_dict for OpenVLAForActionPrediction:
[rank0]: size mismatch for vision_backbone.featurizer.blocks.0.ls1.scale_factor: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([0]).
[rank0]: size mismatch for vision_backbone.featurizer.blocks.0.ls2.scale_factor: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([0]).
[rank0]: size mismatch for vision_backbone.featurizer.blocks.1.ls1.scale_factor: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([0]).
[rank0]: size mismatch for vision_backbone.featurizer.blocks.1.ls2.scale_factor: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([0]).
[rank0]: size mismatch for vision_backbone.featurizer.blocks.2.ls1.scale_factor: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([0]).
[rank0]: size mismatch for vision_backbone.featurizer.blocks.2.ls2.scale_factor: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([0]).
[rank0]: size mismatch for vision_backbone.featurizer.blocks.3.ls1.scale_factor: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([0]).
[rank0]: size mismatch for vision_backbone.featurizer.blocks.3.ls2.scale_factor: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([0]).
[rank0]: size mismatch for vision_backbone.featurizer.blocks.4.ls1.scale_factor: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([0]).
[rank0]: size mismatch for vision_backbone.featurizer.blocks.4.ls2.scale_factor: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([0]).
[rank0]: size mismatch for vision_backbone.featurizer.blocks.5.ls1.scale_factor: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([0]).
[rank0]: size mismatch for vision_backbone.featurizer.blocks.5.ls2.scale_factor: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([0]).
[rank0]: size mismatch for vision_backbone.featurizer.blocks.6.ls1.scale_factor:
To Reproduce
I try to rewrite the code of GRPO Trainer and also use deepspeed ZeRO3 to train the Openvla model.And follow the original way to load OpenVLA:
AutoConfig.register("openvla", OpenVLAConfig)
AutoImageProcessor.register(OpenVLAConfig, PrismaticImageProcessor)
AutoProcessor.register(OpenVLAConfig, PrismaticProcessor)
AutoModelForVision2Seq.register(OpenVLAConfig, OpenVLAForActionPrediction)
Then it met error in AutoModelForVision2Seq.from_pretrained()
If we don't use deepspeed ,it will work and the error disappear.
The text was updated successfully, but these errors were encountered: