Take an LLM, reduce the hidden_size for its matrices, and then overfit it to some text.
This is done to get a lightweight version of the same architecture, for testing.
-
Reduced models can be found in this HF ggml-org repo. Currently supported LLMs:
Architecture HF repo hidden size base (MB) lora (MB) Phi3ForCausalLMmicrosoft/Phi-3-mini-4k-instruct64 20 12 LlamaForCausalLMmeta-llama/Meta-Llama-3-8B-Instruct64 68 52 Gemma2ForCausalLMgoogle/gemma-2-2b64 77 5
- Run with:
make HF_REPO=<your hf model repo>
- What's happening?
make runsets up the repo and then, for each<model-name>:- Fetch
<model-name>from HF. - Reduce the size of the matrices of the model.
- Overfit the model to a paragraph of text (this will be the
basemodel). - Overfit a lora adapter on top of
baseto a different paragraph of text. - Assert models are overfitted.
- Upload these two models to
<your hf model repo>.
- Fetch
- This repo generated the test models in the lora-tests HF GGML repo.
- Tests runs in llama.cpp/tests/test-lora-conversion-inference.sh.
Via a user write access token to be set as the environment variable HF_TOKEN.
-
Environment (
poetryrequired):make setup
-
To run the full script for a specific model run:
python reduce_llms_for_testing/main.py -m "<model-name>" -hf "<your hf model repo>"