You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* Add draft model param to llama class, implement basic prompt lookup decoding draft model
* Use samplingcontext for sampling
* Use 1d array
* Use draft model for sampling
* Fix dumb mistake
* Allow for later extensions to the LlamaDraftModel api
* Cleanup
* Adaptive candidate prediction
* Update implementation to match hf transformers
* Tuning
* Fix bug where last token was not used for ngram prediction
* Remove heuristic for num_pred_tokens (no benefit)
* fix: n_candidates bug.
* Add draft_model_num_pred_tokens server setting
* Cleanup
* Update README
Copy file name to clipboardExpand all lines: README.md
+18
Original file line number
Diff line number
Diff line change
@@ -378,6 +378,24 @@ Then you'll need to use a custom chat handler to load the clip model and process
378
378
)
379
379
```
380
380
381
+
### Speculative Decoding
382
+
383
+
`llama-cpp-python` supports speculative decoding which allows the model to generate completions based on a draft model.
384
+
385
+
The fastest way to use speculative decoding is through the `LlamaPromptLookupDecoding` class.
386
+
387
+
Just pass this as a draft model to the `Llama` class during initialization.
388
+
389
+
```python
390
+
from llama_cpp import Llama
391
+
from llama_cpp.llama_speculative import LlamaPromptLookupDecoding
392
+
393
+
llama = Llama(
394
+
model_path="path/to/model.gguf",
395
+
draft_model=LlamaPromptLookupDecoding(num_pred_tokens=10) # num_pred_tokens is the number of tokens to predict 10 is the default and generally good for gpu, 2 performs better for cpu-only machines.
396
+
)
397
+
```
398
+
381
399
### Adjusting the Context Window
382
400
383
401
The context window of the Llama models determines the maximum number of tokens that can be processed at once. By default, this is set to 512 tokens, but can be adjusted based on your requirements.
numa: Enable NUMA support. (NOTE: The initial value of this parameter is used for the remainder of the program as this value is set in llama_backend_init)
153
159
chat_format: String specifying the chat format to use when calling create_chat_completion.
154
160
chat_handler: Optional chat handler to use when calling create_chat_completion.
161
+
draft_model: Optional draft model to use for speculative decoding.
0 commit comments