You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Can someone help me understand llama.cpp modifications needed to support Llama3-based LLM2Vec?
Background
We have a project making use of Llama3-based LLM2Vec as text embedding for a text-to-content model. While the model is trained with the original Llama3-based LLM2Vec on HF, we would like to make the deployment easier by running LLM2Vec with llama.cpp.
AFAIK, LLM2Vec should be just finetuning on top of the original LLM architecture (Llama3 in our case) but with bidirectional attention. Because llama.cpp supports Llama3, I am hoping running Llama3-based LLM2Vec with llama.cpp is not too much work.
What has been done so far
Merged LLM2Vec finetuned parameters to the original Meta-Llama-3-8B-Instruct and saved as an HF model
I confirmed this merged model outputs the same exact results as the original non-merged model
Converted the above to GGUF with a modified convert_hf_to_gguf.py like below to set the flag for non-causal attention
@Model.register("LlamaBiModel")
class LlamaBiModel(LlamaModel):
model_arch = gguf.MODEL_ARCH.LLAMA
def set_gguf_parameters(self):
super().set_gguf_parameters()
self.gguf_writer.add_causal_attention(False)
Made LLM_ARCH_LLAMA be aware of the optional causal attention flag in llama-model.cpp
Observations
The embeddings from above seems to be garbage
Interestingly, when I turn on the causal attention, the model performs close to expectations while losing some details in the text prompts
With or without the causal attention, the embeddings are surely significantly off numerically from the original HF model
Let me know what I am missing here and/or ideas to try.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Can someone help me understand llama.cpp modifications needed to support Llama3-based LLM2Vec?
Background
We have a project making use of Llama3-based LLM2Vec as text embedding for a text-to-content model. While the model is trained with the original Llama3-based LLM2Vec on HF, we would like to make the deployment easier by running LLM2Vec with llama.cpp.
AFAIK, LLM2Vec should be just finetuning on top of the original LLM architecture (Llama3 in our case) but with bidirectional attention. Because llama.cpp supports Llama3, I am hoping running Llama3-based LLM2Vec with llama.cpp is not too much work.
What has been done so far
convert_hf_to_gguf.py
like below to set the flag for non-causal attentionLLM_ARCH_LLAMA
be aware of the optional causal attention flag inllama-model.cpp
Observations
Let me know what I am missing here and/or ideas to try.
Beta Was this translation helpful? Give feedback.
All reactions