Speculative decoding potential for running big LLMs on consumer grade GPUs efficiently #10466

steampunque · 2024-11-23T17:57:53Z

steampunque
Nov 23, 2024

I recently added an efficient greedy-only spec decode to my downstream server patch (a completely different implementation than the current spec decode PR). I then evaluated tg performance for two cases : 1) Solve the first humaneval problem with coding model and 2) solve the goldcoin problem with general model. I used Qwen 14B for the target and 0.5B, 1.5B, and 3B for the drafts. I evaluated tg vs. draft token length on a 4070 fully offloaded with the target and draft weights where target is IQ4_XS quant and draft is Q6_K quant.

HUMANEVAL first problem:

TARGET Qwen2.5-Coder-14B-Instruct
DRAFTS Qwen2.5-Coder-0.5B-Instruct, Qwen2.5-Coder-1.5B-Instruct, Qwen2.5-Coder-3B-Instruct

TPS vs draft tokens:

draft tokens	0.5B	1.5B	3.0B
0	46.77	46.95	46.59
1	70.08	63.15	56.42
2	78.03	64.85	54.13
3	87.90	68.79	57.96
4	102.74	76.35	62.57
5	102.72	70.51	56.40
6	104.03	70.53	52.77
7	101.85	65.51	51.03
8	111.00	67.08	49.78
9	113.12	66.72	51.08
10	115.41	64.04	47.44
11	113.78	62.30	45.55
12	116.84	61.83	43.77
13	112.34	60.47	42.43
14	111.70	59.16	40.06
15	112.07	54.89	39.10
16	106.33	54.94	37.82
32	93.19	37.49	24.67

GOLDCOIN

I have 10 apples. I find 3 gold coins in the bottom of a river. The river runs near a big city that has something to do with what I can spend the coins on. I then lose 4 apples but gain a gold coin. Three birds run into my path and drop 6 apples each. I play an online game and win 6 gold coins but I have to share them equally with my 2 teammates. I buy apples for all the coins I have. The price of an apple is 0.5 coins. How many apples do I have? And where is the river? Use step-by-step reasoning to solve this problem.

TARGET Qwen2.5-14B-Instruct
DRAFTS Qwen2.5-0.5B-Instruct, Qwen2.5-1.5B-Instruct, Qwen2.5-3B-Instruct

TPS vs draft tokens:

draft tokens	0.5B	1.5B	3.0B
0	46.55	46.51	46.44
1	59.88	55.88	50.23
2	61.99	54.40	46.58
3	63.41	53.76	45.47
4	64.98	54.78	43.87
5	58.75	48.39	38.10
6	55.50	44.24	35.20
7	51.99	42.61	32.03
8	51.42	41.10	31.08
9	49.93	39.42	29.37
10	49.90	37.63	27.47
11	46.82	35.99	26.29
12	45.86	34.62	25.31
13	42.83	33.13	23.57
14	41.71	31.78	22.65
15	39.79	30.85	21.50
16	38.45	28.98	20.42
32	23.99	17.12	11.71

TARGET Llama 3.1 8B Instruct
DRAFT Llama 3.2 1B Instruct

TPS vs draft tokens:

draft tokens	1B
0	62.21
1	89.76
2	101.40
3	108.34
4	112.90
5	113.97
6	111.72
7	105.54
8	109.59

TARGET Gemma 2 9B it IQ4_XS
DRAFT Gemma 2 2B it IQ4_XS

TPS vs draft tokens:

draft tokens	1B
0	62.96
1	62.83
2	54.86
3	50.25
4	50.61
5	44.68
6	40.52
7	36.26
8	31.46

Results Summary:

Coding shows a max speedup of 2.5x tg at 10 draft tokens speculated using 0.5B model. At 1.5B draft the max speedup is 1.63x at 4 draft tokens. At 3B draft the max speedup is 1.33 at 4 draft tokens. The efficiency crossover (where draft+target is the same as no draft) is >32 draft tokens for 0.5B, >16 draft tokens for 1.5B, and 11 draft tokens for 3B.

Goldcoin shows a max speedup of 1.4x tg at 4 draft tokens speculated using 0.5B model. at 1.5 draft the max speedup is 1.17x at 4 draft tokens. At 3B draft the max speedup is 1.08 at 1 draft token. The efficiency crossover (where draft+target is the same as no draft) is 12 tokens for 0.5B, 6 tokens for 1.5B, and 3 tokens for 3B.

With Llama 3.18B instruct drafted by Llama 3.2 1B instruct a speedup in token gen of 1.83x is found at draft tokens of 5.

With Gemma2 9B it drafted by Gemma2 2B it there is never any speculative decoding speedup. Guess 2B not distilled from 9B at all but was trained on a completey different data set.

Conclusions and potential for running big LLMs on consumer grade GPUs:

Small draft model is needed (sine qua non). 0.5B size seems to work well. Any model in the range of 8G or above can benefit by distilling a 0.5B draft and speculating the model. Returns fall off rapidly as draft gets bigger, already questionable at 1.5B and not really useful at 3B draft. Coding is far more efficient than general text gen with speculation. Qwen 2.5 series is perfect for exploiting the potential of speculation.

For running big LLMs on consumer grade GPUs with limited memory it is desired to avoid the need to store all model weights and output layer in VRAM because there is not enough room. Most of the model weights are sitting there doing nothing most of the time, i.e. a 32 layer model has 31 dead weights sitting there occupying VRAM doing nothing 31/32 of the time. To get around this problem it is necessary to dynamically swap the layers into VRAM as they are needed from CPU RAM which is normally much higher capacity. If the draft size at the efficiency crossover is big enough, there may be (emphasis on may, it needs to be investigated for feasibility) enough time to compute the target batch (say 8 to 10 samples) and simultaneously transfer the next layer into the GPU. The GPU capacity needs 1 working layer allocation and one transfer allocation (two total model layers which are ping ponged between compute and transfer) + a fully offloaded speculator. KV for speculator and target should also both be in GPU mem. Even if it is needed to go above the efficiency crossover, it can still be more efficient to do dynamic layer loading to GPU because offloading to CPU is an immediate 10X or higher slowdown due to memory BW limits.

jukofyork · 2024-11-30T07:49:53Z

jukofyork
Nov 30, 2024
Collaborator

Would there be any benefit in pruning down a 0.5B model to be even smaller? From your examples above it looks like the speculative models' size reduction has the biggest effect?

You could prune the later layers like this: https://arxiv.org/abs/2403.17887

but with a calibration dataset you could probably prune down the width of the MLP hidden state quite significantly too... The imatrix code already does a kind of "soft" version of this for quantisation.

I think you could even apply L1-regularisation during fine-tuning to spasify the weights and then remove all those close to zero, but the effectiveness of this would depend on whether the induced sparseness was evenly distributed for the corresponding tensors in each layer (which from the paper above; I doubt is the case).

It would be interesting to see where the balance point is between "tiny and fast/dumb" vs "small but slower/less-dumb" actually is.

If using greedy speculation then it won't make any difference, but if you have to actually apply the softmax (instead of just finding the maximum logit), then for stuff like coding using only English; it would be perfectly valid to remove a lot (most) of the tokens and prune down the input_embedding and lm_head tensors due to softmax (aka multinomial logistic regression) having the IIA property. I'm not sure what happens though if the model encounters a token you have pruned away like this?

5 replies

steampunque Nov 30, 2024
Author

Would there be any benefit in pruning down a 0.5B model to be even smaller? From your examples above it looks like the speculative models' size reduction has the biggest effect?

@jukofyork I tried the AMD-Llama-135m on Llama2 and AMD-Llama-135m-code on CodeLlama and didn't see as much benefit. Just based on pure hunch I think the "magic" of speculation is exposed with distilled models using the full cross entropy loss training metric on a large data set so the model doesn't lose any particular part of its knowledge. It gives a "fuzzy" version of the bigger model tending to follow the same patterns as the big model. Thats why I think Qwen2.5 series and Llama3.1 paired with Llama3.2 work so well.... I think (hypothesize) all are using distilling to make the smaller versions of the models. In theory any created model could be distilled but its compute intensive (similar to training a model from scratch as far as I understand) so need to rely or ask the model creators to make distilled versions of the models they release. I think 0.5B is a reasonable size but could not explore anything smaller since thats where Qwen 2.5 stopped.

I thought more about dynamic layer offload speedup potential too and the numbers I am pushing around seem like the idea can be viable on a 4070 with gen4 PCI (I think a single 4070 might be able to push 5tps gen on Llama 70B which is close to its compute bound potential), but most likely not on a 4090 as the ratio of compute to PCI BW is about 2X on the 4090, it would need gen5 PCI to be viable. Still 2X reduction from peak is a lot better than 10X when trying to compute on CPU with its x10 less or more memory BW. I was thinking about feeding some ggml and the vulkan backend into Qwen 2.5 32B and see if it can suggest how to wedge in dynamic offload concept efficiently.

jukofyork Nov 30, 2024
Collaborator

Just based on pure hunch I think the "magic" of speculation is exposed with distilled models using the full cross entropy loss training metric on a large data set so the model doesn't lose any particular part of its knowledge. It gives a "fuzzy" version of the bigger model tending to follow the same patterns as the big model. Thats why I think Qwen2.5 series and Llama3.1 paired with Llama3.2 work so well.... I think (hypothesize) all are using distilling to make the smaller versions of the models. In theory any created model could be distilled but its compute intensive (similar to training a model from scratch as far as I understand) so need to rely or ask the model creators to make distilled versions of the models they release. I think 0.5B is a reasonable size but could not explore anything smaller since thats where Qwen 2.5 stopped.

Interesting!

I think you're right that "distilled" training is likely to be pretty hard to do as you'll need the full outputs to use cross-entropy loss on (I assume they do it this way?).

If you are only interested in greedy sampling then one interesting alternative would be to use (smoothed) hinge loss on the "all vs one" targets:

Cross-entropy loss gets heavily penalised for the correct class having a low probability output (due to the -log(y) ), but if we only care about greedy sampling then we don't really neen these "well-calibrated" probability estimates at all...
Hinge loss on the other hand only cares about getting the output correct or not, and has no notion of calibration.

Hinge loss has other potential benefits too:

It doesn't have to pass the logits through the softmax function, and essentially works on the raw logits themselves (ie: no calls to exp() or summation needed).
The "distillation" data generation would simply be a case of running the parent model with temperature=0 to generate training data as usual, so no fiddling about with full probability distributions as targets, etc.

The only downside is it can sometimes be a bit tricky to get working with gradient descent.

steampunque Nov 30, 2024
Author

Interesting!

I think you're right that "distilled" training is likely to be pretty hard to do as you'll need the full outputs to use cross-entropy loss on (I assume they do it this way?).

I dont have backround in ML but my basic understanding is teacher student approaches to build models do this. Particularly Llama 3.1 I believe they trained 405B then distilled the smaller models with subsets of data corpus off 405B (pretty sure about that). I am speculating Qwen did the same with their 2.5 series and they wound up with quite performant small models. So its essentially a full train from nothing using cross entropy loss to minize the error between target and teacher on a desired training corpus. Way beyond the compute potential of users must be done by model creators on GPU farms.

If you are only interested in greedy sampling then one interesting alternative would be to use (smoothed) hinge loss on the "all vs one" targets:

I dont know enough about ML to understand the implications here, though based on my comm theory background it seems analagous to a hard decision decode vs a soft decision decode which can often results in a fairly significant entropy loss in communication systems.

* Cross-entropy loss gets heavily penalised for the correct class having a low probability output (due to the `-log(y) `), but if we only care about greedy sampling then we don't really neen these "well-calibrated" probability estimates at all...

Still reserved about possible entropy loss in the training of the draft....

* Hinge loss on the other hand only cares about getting the output correct or not, and has no notion of calibration.

Hinge loss has other potential benefits too:

* It doesn't have to pass the logits through the softmax function, and essentially works on the raw logits themselves (ie: no calls to `exp()` or summation needed).

* The "distillation" data generation would simply be a case of running the parent model with `temperature=0` to generate training data as usual, so no fiddling about with full probability distributions as targets, etc.

The only downside is it can sometimes be a bit tricky to get working with gradient descent.

Thanks for interesting comments! Hopefully more model creators will be creating distilled edge versions of their bigger models which can be easily leveraged to run the bigger models far more effiiciently. It seems like such a no brainer but I dont think it has been done intentionally to date (llama 3.2 series were created for edge apps, not drafting. I also believe smaller Qwen 2.5 were also targeting edge devices and drafting bigger models on heavy lift platforms is just a nice side benefit).

jukofyork Nov 30, 2024
Collaborator

If you are only interested in greedy sampling then one interesting alternative would be to use (smoothed) hinge loss on the "all vs one" targets:

I dont know enough about ML to understand the implications here, though based on my comm theory background it seems analagous to a hard decision decode vs a soft decision decode which can often results in a fairly significant entropy loss in communication systems.

Based on a quick search:

https://www.gaussianwaves.com/2009/12/hard-and-soft-decision-decoding-2/

This would actually be analogous to "0-1 loss" which isn't used in machine learning due not being differentiable. It's the black "square" loss on this graph:

The red line is (binary) cross-entropy loss (aka log loss).
The green line is least-squares loss.
The blue line is hinge-loss.

They are all really a surrogate for 0-1 loss, but with different properties.

The main appeal of cross-entropy loss is that when it's paired with the binary or multinomial logistic function (aka "softmax"), produces well-calibrated probability estimates.

Nobody would consider using least-squares loss for classification now, but back in the 80s and 90s it was very common.

The use of hinge-loss was only popularised in the early-mid 2000s for use with Support Vector Machines (and usually solved via the dual quadratic programming problem and using a non-linear kernel).

But there isn't any reason you can't solve the primal problem without a kernel (but as I said above you would likely want to use one of the smoothed/Huberised variants due to it being trickier to solve via gradient descent - it tends to get "stuck" easily due to the zero-derivatives in a similar way to stock ReLU can get "dead nodes" [if not initialsed properly] and why ML tends to use smoothed versions of ReLU now instead...).

The key difference is if you don't care about the probability estimates then you can use hinge-loss directly on the dot-products of the lm_head matrix with the hidden state, with the goal being to drive the dot-product of the actual target vector to be larger than the dot product of all the other vectors. It's also called "maximum margin loss" for this reason, eg:

The "margin" coming from the little triangle formed by the blue line between 0 and 1 on the previous diagram.

Cross-entropy loss gets heavily penalised for the correct class having a low probability output (due to the -log(y) ), but if we only care about greedy sampling then we don't really neen these "well-calibrated" probability estimates at all...

Still reserved about possible entropy loss in the training of the draft....

Hinge loss on the other hand only cares about getting the output correct or not, and has no notion of calibration.

Again, looking back at the first diagram, you can see that cross-entropy loss naturally wants to drive the output corresponding to the true class further and further away (or looking at the second diagram: once it's crossed the margin - it wants to keep pushing it away).

What you can't see in the first diagram (due to the reduced x-axis) is the use of -log(y) causes cross-entropy loss to massively penalise a single low-probability output for the true class.

Neither of these properties are beneficial if all you care about is predicting the most likely next token. In actual fact, if it were possible the loss we would really like to use here is... The 0-1 loss! This would be the loss that most agrees with our goal of predicting the correct output most often, but since this isn't possible; the use of hinge-loss is likely to be the next best choice!

Hopefully this makes sense, and I'm fairly sure I could easily adapt the Unsloth CE-loss kernel to calculate smoothed/Huberised hinge-loss:

https://github.com/unslothai/unsloth/blob/main/unsloth/kernels/cross_entropy_loss.py

Then it would just be a case of running the larger model over a corpus of stuff you want to predict, but saving the greedy prediction of the larger model at every step as the new ("one-hot") training data for the tiny-models' fine-tuning dataset (I think - not thought about this part too hard).

EDIT: This would potentially open up the possibility of training these tiny models on just a subset of the possible tasks you want to perform, eg: just on coding data or coding related data, etc. Likely meaning an even smaller/faster/more accurate tiny model that specialises in just predicting this subset of the larger models' abilities.

steampunque Dec 1, 2024
Author

EDIT: This would potentially open up the possibility of training these tiny models on just a subset of the possible tasks you want to perform, eg: just on coding data or coding related data, etc. Likely meaning an even smaller/faster/more accurate tiny model that specialises in just predicting this subset of the larger models' abilities.

I think it would be interesting to explore both hinge loss with greedy and cross entropy loss with full distribution match error criteria. Small draft models may be within fine-tune training viability even on consumer grade hardware as they can be loaded F16 or F32 so back prop gradient training can work. The spec decode server already gives the needed platform for teacher-student setup. Needed steps might be

Create a starting draft model from either an existing checkpoint or a pruned checkpoint (most likely /2 prune max feasible)
load starting draft in F16 or F32 so it can be trained
load quantized teacher ("target")
Run an alignment dataset, such as HumanEvalX CPP for improving C++ prediction, or LAMBADA for improving language prediction
For hinge loss: Select N draft tokens >1 so probability of getting a miss is good. At the misses, compute the error feedback and backprop.
For cross entropy loss: Select 1 draft token. Compute cross entropy loss between target and draft logits and backprop for every generated token.

It would be an interesting experiment to start with Gemma 2 2B it checkpoint which can't speculate Gemma 2 9B right now and see if it can be improved for use as a speculator.

jukofyork · 2024-12-07T10:11:56Z

jukofyork
Dec 7, 2024
Collaborator

Just thinking about this some more and wondered how feasible it would be:

To have the draft model perform Beam Search?
To have the large model test multiple sequences in parallel instead of just a single sequence?

I'm thinking along the lines of using the draft model to create a tree (with probabilies on the edges and tokens in the nodes), and then use it to decide on a set of batches for the larger model to generate in parallel.

If we constrain the branching factor to a fixed k, then we can again use Hinge Loss to try to pick the top-k using k-vs-all.

I don't have a good idea of how the cost of batch processing grows though and it all depends on this.

8 replies

steampunque Dec 7, 2024
Author

The idea was to use Beam Search using the small model to generate a truncated tree of possibilities:

* The cost of this will be O(k^n) if done naively (k = branching factor / beam width, n = tree depth / sequence length).

* If you have access to lots of parallel compute this can be brought down to ~O(kn) as is the case for parallelised decision tree induction, etc.

The k sequence wavefronts in beam search need to be time aligned. At any time t each iteration of a width k beam search will have k beams all of which must be validated by the target. You could in theory independently advance the k beams forward a block of N sample time steps each, N>1, using the draft, but none of this advance has been validated by the target. Once you start validating each advance with the target (in parallel blocks of N), the first miss in each beam will define how far forward you can advance at that iteration. If one of the beams misses at the first token, which becomes more probable as k increases, the advance for all candidate sequences is limited to one token for all k beams. If this happens on the last beam evaluated, you are forced to throw out all the computation done on the previous beams to result in advancing one step for all beams at that iteration, including all the computation for the length N drafts of k beams and all the parallel validation of that advance by the target on all previous beams.

The algorithm I am talking about (and which I use) is O(k*n). At each time step it computes only the k most probable sequences in the search. I am not familiar with any O(k^n) beam search algorithm.

jukofyork Dec 7, 2024
Collaborator

The idea was to use Beam Search using the small model to generate a truncated tree of possibilities:
* The cost of this will be O(k^n) if done naively (k = branching factor / beam width, n = tree depth / sequence length).

* If you have access to lots of parallel compute this can be brought down to ~O(kn) as is the case for parallelised decision tree induction, etc.
The k sequence wavefronts in beam search need to be time aligned. At any time t each iteration of a width k beam search will have k beams all of which must be validated by the target. You could in theory independently advance the k beams forward a block of N sample time steps each, N>1, using the draft, but none of this advance has been validated by the target. Once you start validating each advance with the target (in parallel blocks of N), the first miss in each beam will define how far forward you can advance at that iteration. If one of the beams misses at the first token, which becomes more probable as k increases, the advance for all candidate sequences is limited to one token for all k beams. If this happens on the last beam evaluated, you are forced to throw out all the computation done on the previous beams to result in advancing one step for all beams at that iteration, including all the computation for the length N drafts of k beams and all the parallel validation of that advance by the target on all previous beams.

The algorithm I am talking about (and which I use) is O(k*n). At each time step it computes only the k most probable sequences in the search. I am not familiar with any O(k^n) beam search algorithm.

Yeah, you're correct: I was thinking about k-child breath-first search for decision tree induction which would be O(k^n), but if you have enough threads can be performed for all nodes of the same depth in parallel (so actually O(n) regardless of k).

I've probably done a horrible job explaining it, but I was thinking about a much simpler method and not really necessarily needing to use Beam Search:

Do all the expensive compute on the draft model to get a bunch of candidate sequences to test (such is the speculative : add tree-based sampling example #3624 PR).
Run all the sequences through the large model in parallel, storing the top logit for each token in each sequence.
Find the longest prefix where the top logit agrees and advance to here (and if all sequences don't match just advance 1 token from the large model's output).
Return to step 1.

You could use Beam Search, heuristic truncation as #3624, Best-first Search, or whatever to generate the tree in (1).

The only reason for wanting to do k-child breath-first search is that it would be just as easy to train a tiny model using hinge loss for predicting top-k as top-1, and in theory each depth / sequence length could be done in parallel (so O(n) if you can do all sequences at a certain depth in a batch on the smaller model), but my post kinda mixed up these two and I can see where the confusion comes from :)

jukofyork Dec 7, 2024
Collaborator

This does a much better job of explaining what I'm thinking:

https://catboost.ai/news/catboost-enables-fast-gradient-boosting-on-decision-trees-using-gpus

I don't really know the scaling laws about increasing the batch size for the tiny model like this, but if it were free (which it obviously isn't) then you could compute a tree of all k^n sequences for the same cost as a single auto-regressive generation of n tokens...

Using hinge loss might produce better top-k predictions at each node, but possibly harder to rank the sequences against each other.
Using cross-entropy loss might produce less optimal top-k predictions (for all the reasons I outlined above regarding calibration), but will make it easier to rank the sequences (ie: via the product of probabilities).

Once you have chosen your set of candidate sequences, it's really just a case of seeing if batch processing more than 1 sequence for the larger model in parallel actually gains anything (and using the code in #3624 might make some of the overlapping sequences much cheaper to compute?).

steampunque Dec 7, 2024
Author

This does a much better job of explaining what I'm thinking:

https://catboost.ai/news/catboost-enables-fast-gradient-boosting-on-decision-trees-using-gpus

Yeah thats a different beast than a straightforward beam search for inference. I don't understand all the details of what they are talking about but they do seem to need exponential gradient tree expansion (no leaf pruning) which is a beast. I just feel the target is going to blow out every one of those drafted trees as invalid though for even small values of depth "n" in speculation, particularly on a non pruned exponential tree. Hunch is spec is a complete bust for this type of algo.

You got me thinking though. I think it is possible to leverage spec in either 1) adaptive beam search (where decode stays single beam until it hits a low prob token, then transitions into beam search until the most likely path gets a token with prob above adapt threshold at which point it continues single beam again) or 2) front load beam search, where only N tokens at the start of inference are done with k beams, followed by selection of the best beam and single beam from thenon. In both of those case spec can be used effectively any time the algorithm is in single beam mode. Adaptive beam search might be able to assist in training by keeping the target on a higher probability path so the training updates are done with improved data.

jukofyork Dec 7, 2024
Collaborator

Yeah, there seem to be endless possibilities, but it all comes down to the relative cost of batching.

steampunque · 2024-12-08T18:22:00Z

steampunque
Dec 8, 2024
Author

Testing my server rebase for regressions after all the recent changes along with a few new "LRM" (Marco-o1 and QwQ) models and RPC mode also. The spec algo I implemented is greedy match with fixed size draft block and no probs computes.

Hardware: RTX4070
32B models use RPC to another 4070 rig with draft 100% on local GPU and target fully offload to both 4070s.

GOLDCOIN:

DRAFT	TARGET	draft tokens	TG	X	Note
Qwen2.5-0.5B-32k-Instruct IQ4_XS	Qwen2.5-7B-Instruct Q6_K	0	65.7	1
"	"	4	92.32	1.41
"	Marco-o1 Q6_K	0	65.6	1	wrong answer
"	"	4	84.4	1.29	wrong answer
"	Qwen2.5-14B-Instruct IQ4_XS	0	46.6	1
"	"	4	78.0	1.67
"	Qwen2.5-32B-Instruct IQ4_XS	0	17.8	1	RPC
"	"	4	32.9	1.84	RPC
"	QwQ-32B-Preview IQ4_XS	0	17.8	1	RPC
"	"	4	27.4	1.54	RPC
Llama-3.2-1B-Instruct IQ4_XS	Llama-3.1-8B-Instruct Q6_K	0	62.4	1
"	"	4	123.32	1.97
gemma-2-2b-it IQ4_XS	gemma-2-9b-it IQ4_XS	0	63.05	1
"	"	1	59.4	0.94

HUMANEVAL 1ST PROBLEM

DRAFT	TARGET	draft tokens	TG	X	Note
Qwen2.5-Coder-0.5B-32k-Instruct Q6_K	Qwen2.5-Coder-7B-Instruct Q6_K	0	65.0	1
"	"	8	115.0	1.75
"	Qwen2.5-Coder-14B-Instruct IQ4_XS	0	46.9	1
"	"	8	111.05	2.36
"	Qwen2.5-Coder-32B-Instruct IQ4_XS	0	13.1	1	RPC
"	"	8	37.7	2.9	RPC

3 replies

ggerganov Dec 8, 2024
Maintainer

How does your implementation compare to the spec approach on master? You can do --draft-max 4 --draft-min 0 --draft-p-min 0.0 to get a fixed draft size of 4.

steampunque Dec 8, 2024
Author

Unfortunately I can't run the master spec algorithm due to a large number of other changes I require in the server to support my shell based model loading and inference platform. I expect I will be slightly faster as I don't need any probs computed. Also I made sure to leverage sampling the next token from the target for free if I get hits on all the drafted tokens as the deepmind guys suggested, so I think I am at max theoretical performance for a simple fixed block length spec algorithm.

Draft quality along with generation content is going to dominate the spec performance in practice. Just a simple example of how huge this is showing 3X speedup on predicable content and no speedup on unpredictable content. I don't think unpredictable content will be typical in practice though unless forced to be pathologically unpredictable with the prompt.

Llama 3.1 8B drafted by Llama 3.2 1B:

speceasy.txt: generate the integers from 0 to 100 separated by spaces

TIME=1 NDRAFT=8 lm speceasy.txt
Here are the integers from 0 to 100 separated by spaces:

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
PP=315.31273290145043 TG=176.53274968555104

TIME=1 NDRAFT=0 lm speceasy.txt
Here are the integers from 0 to 100 separated by spaces:

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
PP=324.03452440569123 TG=61.82976605162269

spechard.txt: generate fifty random integers from 0 to 100 separated by spaces

TIME=1 NDRAFT=8 lm spechard.txt
Here are 50 random integers from 0 to 100 separated by spaces:

14 73 28 42 91 19 67 85 31 46 13 59 75 22 88 49 62 11 98 35 76 29 43 81 54 17 93 24 69 58 41 95 27 52 79 18 65 39 82 47 21 94 33 60 71 25 89 38 56 97 48 23 64 44 90 16 72 50 86 32 55 83 26 68 37 92 20 57 74 40 63 51 80 34 61 78 30 96 15 99 6 8 4 2 0 1 5 7 3 9 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 82 84 86 88 90 92 94 96 98 100
PP=277.91875105729963 TG=63.09788256213333

TIME=1 NDRAFT=0 lm spechard.txt
Here are 50 random integers from 0 to 100 separated by spaces:

14 73 28 42 91 19 67 85 31 46 13 59 75 22 88 49 62 11 98 35 76 29 43 81 55 24 93 18 69 52 97 38 65 16 82 41 58 94 25 72 48 21 89 33 60 79 15 44 90 27 63 86 51 39 95 20 74 57 83 32 66 47 17 92 40 61 80 26 54 68 36 84 10 45 96 23 71 50 78 34 87 56 64 12 70 53 77 30 92 8 99 6 58 98 4 92 5 91 3 97 2 94 1 96 7 93 9
PP=505.0283255017347 TG=61.961273665165585

ggerganov Dec 8, 2024
Maintainer

Unfortunately I can't run the master spec algorithm due to a large number of other changes

Not sure about the Goldcoin and Humaneval tests, but you can run the integer prompts on master like this:

# start the server without SD
./bin/llama-server \
    -m ../models/qwen2.5-7b-coder-instruct/ggml-model-q8_0.gguf \
    --port 8011 --ctx-size 8192 -ngl 99 -fa

# start the server with SD
./bin/llama-server \
    -m ../models/qwen2.5-7b-coder-instruct/ggml-model-q8_0.gguf \
    -md ../models/qwen2.5-0.5b-coder-instruct/ggml-model-q4_0.gguf \
    --port 8011 --ctx-size 8192 -ngl 99 -ngld 99 -fa \
    --draft-max 4 --draft-min 0 --draft-p-min 0.0

# send request
curl --request POST --url http://localhost:8011/v1/chat/completions -H "Content-Type: application/json" -H "Authorization: Bearer no-key" -d '{ "messages": [{ "role": "system", "content": "You are a helpful assistant." }, { "role": "user", "content": "generate the integers from 0 to 100 separated by spaces" }], "top_k": 1, "samplers": ["top-k"]}' | jq

Draft quality along with generation content is going to dominate the spec performance in practice.

Yes, I agree.

Thomas-MMJ · 2025-02-06T01:03:18Z

Thomas-MMJ
Feb 6, 2025

@steampunque any update on this?

1 reply

steampunque Feb 6, 2025
Author

@steampunque any update on this?

I havent had a chance to dig into it yet but I still think its viable particularly for the new 5000 series from NV with PCI gen 5 combined with a supporting motherboard (though I think gen4 x16 is also viable, i.e. I think 4070 or any other x16 gen 4 card can potentially work). Originally I planned to prototype idea in vulkan due to simpler backend and I have also done some graphics related vulkan API programming but I found spec decode does not speed up vulkan at all. In order for the idea to work, decode a batch of 4 needs to be just slightly more time than decode a batch of 1 to net out a max possible speed up factor of ~3. I still do plan to dig into it but decided to hold off due in part to large amount of churn in ggml backend over last many months which I dont feel like rebasing patches to every other day.

Since the layer decoding is a serial process by nature all that is really needed is enough VRAM to hold one active compute layer and one transfer layer which are ping ponged. Hidden state and KV for each layer stay also in VRAM. Output layer may want to stay in VRAM also. Should completely obsolete need for RPC on models which can be speculated as long as host has enough main memory. Llama and Qwen both speculate extremely well but Gemma is an example which I found cannot speculate well so it is not universally applicable.

Deepseek R1 spec with latest version looks good so I'm still encouraged. I'm pretty sure the speculated Deepseek R1 32B is inferencing correctly since I recently benched it at 93.0 AVG on Hendryks math levels 1 to 5 using my spec algo also with the R1 1.5B draft as compared to the 94.3 published by Deepseek which is most likely not quantized at all.

Spec algo : greedy match with fixed size draft block and no probs computes.

Hardware: RTX4070
32B models use RPC to another 4070 rig with draft 100% on local GPU and target fully offload to both 4070s.

GOLDCOIN:

DRAFT	TARGET	draft tokens	TG	X	Note
Deepseek-R1-Distill-Qwen-1.5B IQ4_XS	Deepseek-R1-Distill-Qwen-32B IQ4_XS	0	17.9	1	RPC
"	"	4	39.6	2.21
"	Deepseek-R1-Distill-Qwen-14B IQ4_XS	0	46.3	1
"	"	4	61.4	1.33
"	Deepseek-R1-Distill-Qwen-7B Q6_K	0	65.2	1
"	"	4	76.6	1.17

jukofyork · 2025-04-08T19:31:36Z

jukofyork
Apr 8, 2025
Collaborator

What's the common wisdom on quantising the speculative model?

I can see one argument for not quantising it as the errors will accumulate over the sequence, but there is also the argument that a quantised model will generate tokens faster (due to being memory bound) and add less latency?

6 replies

steampunque Apr 8, 2025
Author

I run IQ4_XS because its both smaller and better than Q4_0 and doesn't seem to have much speed hit in the aggregate on cuda backend (there have been comments in past that its much slower on CPU dont know if that was ever resolved. If still true then Q4_0 would be my recommendation). The draft is typically 10x or more smaller than target so the time it spends on its token by token eval should be negligible compared to the time the target needs to eval the drafted batch of tokens.

There should never be an issue with cumulative noise for draft. The target keeps the draft history 100% accurate. If draft token is good, KV is still 100% accurate for next draft token. If draft token is bad, doesn't matter anyway since its getting tossed.

jukofyork Apr 8, 2025
Collaborator

I run IQ4_XS because its both smaller and better than Q4_0 and doesn't seem to have much speed hit in the aggregate on cuda backend (there have been comments in past that its much slower on CPU dont know if that was ever resolved. If still true then Q4_0 would be my recommendation).

Yeah, this was my thinking and what I thought was probably the recommended quant to use for CUDA:

https://huggingface.co/jukofyork/DeepSeek-R1-DRAFT-0.5B-v1.0-GGUF

but now I'm not so sure... :/

There should never be an issue with cumulative noise for draft. The target keeps the draft history 100% accurate. If draft token is good, KV is still 100% accurate for next draft token. If draft token is bad, doesn't matter anyway since its getting tossed.

It's the forward noise that accumulates, eg: misestimating the token at t can effect t+1 which is turn effects t+2 and so on. This probably isn't a huge problem if you are just sampling a fixed sequence of n tokens like your method here, but the probability thresholded version in llama.cpp may be more problematic due to the cumulative effect of these errors.

jukofyork Apr 8, 2025
Collaborator

Good question. After some experimenting, on a Mac, I settled on 32B Q8_0 target + 1.5B Q4_0. Cannot say confidently that this is the best setting, but I think it is pretty good. Would be interesting to do some quantitative evaluations though.

The Mac doesn't handle IQ4_XS all that well IIRC? What about the Q4_1, Q5_0 and Q5_1 quants?

I was surprised how much these improved over Q4_0:

https://huggingface.co/jukofyork/DeepSeek-R1-DRAFT-0.5B-v1.0-GGUF

especially when you used an imatrix for them!?

jukofyork Apr 8, 2025
Collaborator

Here's the full table:

Without `imatrix`

Link	Type	PPL	PPL vs BF16
DeepSeek-R1-DRAFT-0.5B-BF16.gguf	BF16	11.0267 ± 0.08658	---
DeepSeek-R1-DRAFT-0.5B-F16.gguf	F16	11.0294 ± 0.08660	+0.02%
DeepSeek-R1-DRAFT-0.5B-Q8_0.gguf	Q8_0	11.0450 ± 0.08675	+0.17%
DeepSeek-R1-DRAFT-0.5B-Q6_K.gguf	Q6_K	11.1231 ± 0.08732	+0.87%
DeepSeek-R1-DRAFT-0.5B-Q5_K_M.gguf	Q5_K_M	11.2727 ± 0.08902	+2.23%
DeepSeek-R1-DRAFT-0.5B-Q5_K_S.gguf	Q5_K_S	11.2803 ± 0.08888	+2.30%
DeepSeek-R1-DRAFT-0.5B-Q4_K_M.gguf	Q4_K_M	11.8171 ± 0.09319	+7.17%
DeepSeek-R1-DRAFT-0.5B-Q4_K_S.gguf	Q4_K_S	11.9379 ± 0.09380	+8.26%
DeepSeek-R1-DRAFT-0.5B-IQ4_NL.gguf	IQ4_NL	11.8497 ± 0.09445	+7.46%
DeepSeek-R1-DRAFT-0.5B-IQ4_XS.gguf	IQ4_XS	11.8600 ± 0.09464	+7.56%
DeepSeek-R1-DRAFT-0.5B-Q5_1.gguf	Q5_1	11.3624 ± 0.08926	+3.05%
DeepSeek-R1-DRAFT-0.5B-Q5_0.gguf	Q5_0	11.5217 ± 0.09124	+4.49%
DeepSeek-R1-DRAFT-0.5B-Q4_1.gguf	Q4_1	12.3107 ± 0.09765	+11.64%
DeepSeek-R1-DRAFT-0.5B-Q4_0.gguf	Q4_0	12.6168 ± 0.10021	+14.42%

With `imatrix`

Link	Type	PPL	PPL vs BF16
DeepSeek-R1-DRAFT-0.5B-iQ6_K.gguf	Q6_K	11.0940 ± 0.08714	+0.61%
DeepSeek-R1-DRAFT-0.5B-iQ5_K_M.gguf	Q5_K_M	11.2333 ± 0.08819	+1.87%
DeepSeek-R1-DRAFT-0.5B-iQ5_K_S.gguf	Q5_K_S	11.2238 ± 0.08798	+1.79%
DeepSeek-R1-DRAFT-0.5B-iQ4_K_M.gguf	Q4_K_M	11.6273 ± 0.09165	+5.45%
DeepSeek-R1-DRAFT-0.5B-iQ4_K_S.gguf	Q4_K_S	11.7004 ± 0.09225	+6.11%
DeepSeek-R1-DRAFT-0.5B-iIQ4_NL.gguf	IQ4_NL	11.6495 ± 0.09192	+5.65%
DeepSeek-R1-DRAFT-0.5B-iIQ4_XS.gguf	IQ4_XS	11.6924 ± 0.09246	+6.04%
DeepSeek-R1-DRAFT-0.5B-iQ5_1.gguf	Q5_1	11.2001 ± 0.08792	+1.57%
DeepSeek-R1-DRAFT-0.5B-iQ5_0.gguf	Q5_0	11.3579 ± 0.08961	+3.00%
DeepSeek-R1-DRAFT-0.5B-iQ4_1.gguf	Q4_1	11.7469 ± 0.09250	+6.53%
DeepSeek-R1-DRAFT-0.5B-iQ4_0.gguf	Q4_0	12.1546 ± 0.09619	+10.23%

jukofyork Apr 8, 2025
Collaborator

If IQ4_XS runs badly on Mac, then Q5_0 looks like a very competitive choice here?

Saying that Q6_K looks way more appealing unless that also performs a lot worse.

jukofyork · 2025-04-08T21:47:04Z

jukofyork
Apr 8, 2025
Collaborator

This problem seems to be a perfect target for Bayesian Filtering.

I also wonder if we should have two probability thresholds:

Decides to bother to do a batch at all (this seems to have quite a high cost associated with it).
If we have decided to run a batch, what the probability should drop below to decide it's not worth adding more to the batch (the cost associated with additional tokens seems much lower compared to the overhead of starting a batch at all).

At least for r1, there seems to be clear regime shifts where the tiny model starts to become much more "confidently wrong" (especially when writing code) and drafts long sequences of incorrect tokens. The Softmax Bottleneck likely makes it quite hard for the smaller model to accurately predict sequence probabilities for different regimes (eg: the reasoning section, natural language sections and code sections, will all have very different properties), due to having a much smaller hidden_dim:n_vocab ratio than the target model. Filtering would likely be able to catch this sort of thing reasonably quickly.

Finally, I think the estimation errors and batch costs may not be static throughout the generation:

At the start there is much less to go on for the predictions.
As the context increases there is much more for the draft model to use to help its predictions.
There may also be additional costs associated with a bad draft as the context increases (eg: larger KV-cache may start to cost more and more for a miss).

Again, filtering could use some second order terms to predict something akin to acceleration here.

This seems a useful collection:

https://github.com/hemingkx/SpeculativeDecodingPapers

0 replies

jukofyork · 2025-04-09T05:46:43Z

jukofyork
Apr 9, 2025
Collaborator

Just remembered this post:

https://old.reddit.com/r/LocalLLaMA/comments/1iu8f7s/speculative_decoding_can_identify_broken_quants/

and it uses a quant of the same model for speculative decoding:

and interestingly, this shows little difference between the levels of quants (ignoring the problems with Q3 he's trying to highlight).

1 reply

steampunque Apr 9, 2025
Author

I think it makes sense. Draft only needs to get two tokens right on average for doubling the speed of inference (with a draft size of 4) given a 100% correct context history, which is the majority of benefit that can be expected. However for running longer drafts such as 8 on code models I would stick with Q4 or above. In fact I would stick with Q4 or above in all cases anyway unless I really wanted to get the memory footprint of the draft smaller to open up space for weights and KV.

Djip007 · 2025-04-09T20:31:26Z

Djip007
Apr 9, 2025

Do you think it is possible to compute the 'KL Divergence'' of DRAFT model again TARGET one?

2 replies

steampunque Apr 9, 2025
Author

Cross entropy loss is easier and more relevant with speculation since its just the difference between the probability of the token decoded in the target with that same token in the draft i.e. loss=E{p_target(T)-p_draft(T)}) as an average where T is decoded token. If a model is self speculated that loss comes out zero by definition. If low entropy quants are used in draft a self speculated cross entropy loss would give a numeric measure of the loss from the quant. The ratio of the number of draft "hits" to total drafted tokens is another useful loss metric which can be computed for essentially free and is essentially what one is looking for in a good draft model so is probably the best metric to use to assess a DRAFT for use with TARGET.

Djip007 Apr 18, 2025

The ratio of the number of draft "hits" to total drafted tokens is another useful loss metric

sure it may be the best.
My idea was if it is possible is to use the actual llama-perplexity with something like:

llama-perplexity --kl-divergence-base <target.kld> --kl-divergence -m draft.gguf.

(I am not sure it work...)
If it can I expect that we at least can compare draft models and quantisation impact. As you say if this is the same model the KL Divergence will be 0.

I release that for now only the llama-server have draft support, not llama-bench / llama-cli, don't know how hard it is to add on this.

Speculative decoding potential for running big LLMs on consumer grade GPUs efficiently #10466

steampunque Nov 23, 2024

Replies: 8 comments · 26 replies

jukofyork Nov 30, 2024 Collaborator

steampunque Nov 30, 2024 Author

jukofyork Nov 30, 2024 Collaborator

steampunque Nov 30, 2024 Author

jukofyork Nov 30, 2024 Collaborator

steampunque Dec 1, 2024 Author

jukofyork Dec 7, 2024 Collaborator

steampunque Dec 7, 2024 Author

jukofyork Dec 7, 2024 Collaborator

jukofyork Dec 7, 2024 Collaborator

steampunque Dec 7, 2024 Author

jukofyork Dec 7, 2024 Collaborator

steampunque Dec 8, 2024 Author

ggerganov Dec 8, 2024 Maintainer

steampunque Dec 8, 2024 Author

ggerganov Dec 8, 2024 Maintainer

Thomas-MMJ Feb 6, 2025

steampunque Feb 6, 2025 Author

jukofyork Apr 8, 2025 Collaborator

steampunque Apr 8, 2025 Author

jukofyork Apr 8, 2025 Collaborator

jukofyork Apr 8, 2025 Collaborator

jukofyork Apr 8, 2025 Collaborator

Without imatrix

With imatrix

jukofyork Apr 8, 2025 Collaborator

jukofyork Apr 8, 2025 Collaborator

jukofyork Apr 9, 2025 Collaborator

steampunque Apr 9, 2025 Author

Djip007 Apr 9, 2025

steampunque Apr 9, 2025 Author

Djip007 Apr 18, 2025

steampunque
Nov 23, 2024

Replies: 8 comments 26 replies

jukofyork
Nov 30, 2024
Collaborator

steampunque Nov 30, 2024
Author

jukofyork Nov 30, 2024
Collaborator

steampunque Nov 30, 2024
Author

jukofyork Nov 30, 2024
Collaborator

steampunque Dec 1, 2024
Author

jukofyork
Dec 7, 2024
Collaborator

steampunque Dec 7, 2024
Author

jukofyork Dec 7, 2024
Collaborator

jukofyork Dec 7, 2024
Collaborator

steampunque Dec 7, 2024
Author

jukofyork Dec 7, 2024
Collaborator

steampunque
Dec 8, 2024
Author

ggerganov Dec 8, 2024
Maintainer

steampunque Dec 8, 2024
Author

ggerganov Dec 8, 2024
Maintainer

Thomas-MMJ
Feb 6, 2025

steampunque Feb 6, 2025
Author

jukofyork
Apr 8, 2025
Collaborator

steampunque Apr 8, 2025
Author

jukofyork Apr 8, 2025
Collaborator

jukofyork Apr 8, 2025
Collaborator

jukofyork Apr 8, 2025
Collaborator

Without `imatrix`

With `imatrix`

jukofyork Apr 8, 2025
Collaborator

jukofyork
Apr 8, 2025
Collaborator

jukofyork
Apr 9, 2025
Collaborator

steampunque Apr 9, 2025
Author

Djip007
Apr 9, 2025

steampunque Apr 9, 2025
Author