You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)
Reproduction
When trying to convert BLIP-2 checkpoints using the transformers conversion script(convert_blip_2_original_to_pytorch.py), the following error occurs:
Traceback (most recent call last):
File "/workspace/image_captioning/eon/transformers/src/transformers/models/blip_2/convert_blip_2_original_to_pytorch.py", line 388, in <module>
convert_blip2_checkpoint(
File "/opt/conda/envs/llava/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/workspace/image_captioning/eon/transformers/src/transformers/models/blip_2/convert_blip_2_original_to_pytorch.py", line 298, in convert_blip2_checkpoint
original_logits = original_model({"image": original_pixel_values, "text_input": [""]}).logits
AttributeError: 'dict' object has no attribute 'logits'
The error occurs because the LAVIS BLIP-2 implementation's forward method only returns a dictionary containing the loss value.
...
def forward(self, samples):
image = samples["image"]
with self.maybe_autocast():
image_embeds = self.ln_vision(self.visual_encoder(image))
image_atts = torch.ones(image_embeds.size()[:-1], dtype=torch.long).to(
image.device
)
query_tokens = self.query_tokens.expand(image_embeds.shape[0], -1, -1)
query_output = self.Qformer.bert(
query_embeds=query_tokens,
encoder_hidden_states=image_embeds,
encoder_attention_mask=image_atts,
return_dict=True,
)
inputs_opt = self.opt_proj(query_output.last_hidden_state)
atts_opt = torch.ones(inputs_opt.size()[:-1], dtype=torch.long).to(image.device)
self.opt_tokenizer.padding_side = "right"
text = [t + "\n" for t in samples["text_input"]]
opt_tokens = self.opt_tokenizer(
text,
return_tensors="pt",
padding="longest",
truncation=True,
max_length=self.max_txt_len,
).to(image.device)
targets = opt_tokens.input_ids.masked_fill(
opt_tokens.input_ids == self.opt_tokenizer.pad_token_id, -100
)
if self.prompt:
targets[:, : self.prompt_length] = -100 # do not apply loss to the prompt
empty_targets = (
torch.ones(atts_opt.size(), dtype=torch.long).to(image.device).fill_(-100)
)
targets = torch.cat([empty_targets, targets], dim=1)
inputs_embeds = self.opt_model.model.decoder.embed_tokens(opt_tokens.input_ids)
inputs_embeds = torch.cat([inputs_opt, inputs_embeds], dim=1)
attention_mask = torch.cat([atts_opt, opt_tokens.attention_mask], dim=1)
with self.maybe_autocast():
outputs = self.opt_model(
inputs_embeds=inputs_embeds,
attention_mask=attention_mask,
return_dict=True,
labels=targets,
)
loss = outputs.loss
return {"loss": loss}
Expected behavior
The BLIP-2 model forward pass should return both loss and logits in its output dictionary to be compatible with the transformers conversion script, like this:
return {
"loss": loss,
"logits": outputs.logits# Currently missing in LAVIS implementation
}
The text was updated successfully, but these errors were encountered:
System Info
transformers
version: 4.47.0.dev0Who can help?
@amyeroberts, @qubvel
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
When trying to convert BLIP-2 checkpoints using the transformers conversion script(convert_blip_2_original_to_pytorch.py), the following error occurs:
The error occurs because the LAVIS BLIP-2 implementation's forward method only returns a dictionary containing the loss value.
Expected behavior
The BLIP-2 model forward pass should return both loss and logits in its output dictionary to be compatible with the transformers conversion script, like this:
The text was updated successfully, but these errors were encountered: