Hello!! #2

andrea-tomassi · 2026-03-23T06:43:30Z

andrea-tomassi
Mar 23, 2026

I was looking at this project; I arrived here after following your work on speculative decoding for llama.cpp and I took a quick look around. I therefore wanted to share my experience and ask you a question.

I noticed you chose Qwen14B, and that really piqued my interest, because I consider it one of the best models available, perhaps today the best one that can run within 16 GB of VRAM.
The only other alternative I found for 16 GB of VRAM is this one here: Intel/Qwen3-30B-A3B-Instruct-2507-gguf-q2ks‑mixed‑AutoRound, which is incredibly competitive compared with the 14B.

Unfortunately, when applied to the same scenarios (agentic task execution), Qwen‑3.5‑9B has several issues in terms of reliability and maturity, and I have still not managed to use it successfully.

I was curious to hear about your experience in this regard, and then I wanted to ask you: why don’t you use vLLM with the 14B? I find it much more performant than llama.cpp, especially where it can leverage parallelism. That said, I must admit I have never enabled speculative decoding myself; do you already know that there are issues in vLLM with speculative decoding and Qwen 14B?

Hi again, and congratulations once more on your work.

itigges22 · 2026-03-23T16:36:26Z

itigges22
Mar 23, 2026
Maintainer

Hi @andrea-tomassi, thank you for following my MTP work on llama.cpp and for the kind words!

Why llama.cpp over vLLM?

This is a great question, and it actually goes against my grain a bit, since I've done work at Red Hat sharing vLLM to enterprises looking for inference solutions. vLLM is excellent, and I do plan on porting ATLAS to support it eventually, as it will provide meaningful speedups and allow ATLAS to serve wider use cases beyond locally hosted setups (enterprise, government agencies prioritizing on-prem deployments, etc.).

However, for locally hosted machines, which is ATLAS's primary focus, llama.cpp is the most lightweight solution. It doesn't put excessive strain on an average computer. The key issue with vLLM in this context is memory: vLLM requires loading the model into both RAM and VRAM simultaneously. ATLAS already uses around 12 GB of RAM to run the energy model (Geometric Lens), best-of-k candidate evaluation, sandbox environments, and other pipeline components. This would push the minimum RAM requirement to 32 GB+, which is becoming increasingly difficult in today's market. llama.cpp's memory-mapped approach avoids this entirely.

On the Intel Qwen3-30B-A3B model:

I did evaluate MoE models like the Qwen3-30B-A3B. The challenge is twofold. First, to fit an adequate context window alongside a 30B model on 16 GB VRAM, you'd need to quantize down to 2-bit, which significantly degrades output quality. Second, and this is specific to ATLAS, a dense model actually performs better for our Geometric Lens verifier. The Lens maps the model's self-embeddings through a learned cost field C(x) and metric tensor G(x) to score code quality. Dense models provide a richer, more complete picture of the embedding space, which allows G(x) to apply corrections more accurately. MoE models with sparse activations produce embeddings that only reflect a subset of the model's knowledge per token, which reduces the Lens's discriminative power.

On Qwen3.5-9B reliability:

You're right that Qwen3.5-9B has reliability issues out of the box. I've experienced this firsthand. But this is actually where ATLAS shines. What I've found is that the model almost always understands how to solve a task correctly, but it often gets one small thing wrong: an off-by-one error, a missed edge case, a formatting issue. This is exactly why we have best-of-k candidate generation alongside the Geometric Lens verifier. The pipeline generates multiple candidates, scores them through C(x), and selects the one most likely to be correct. The small errors that make the raw model unreliable get filtered out before reaching the end user. On LiveCodeBench, the raw 9B model achieves around 66% pass@1, but with the ATLAS pipeline we're targeting 80%+ on the same tasks.

Thanks again for the interest, happy to discuss further!

1 reply

andrea-tomassi Mar 23, 2026
Author

Thank you for the detailed reply — it was really interesting for me to read.
I completely understand and agree, even without grasping all the details, that dense models have their own unique characteristics.
I’m still very interested in them, and your confirmation is very valuable to me.
As for the 2-bit models, I share and understand your skepticism, but if you ever need an ultra-fast sparse model (for other purposes) that runs in 16 GB, give that specific quantization a try. I’ve done some blind tests, and at our company we can barely tell it apart from the 4-bit version. 🤓
Wishing you a great day.

Ciao!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hello!! #2

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Hello!! #2

Uh oh!

andrea-tomassi Mar 23, 2026

Replies: 1 comment · 1 reply

Uh oh!

itigges22 Mar 23, 2026 Maintainer

Uh oh!

andrea-tomassi Mar 23, 2026 Author

andrea-tomassi
Mar 23, 2026

Replies: 1 comment 1 reply

itigges22
Mar 23, 2026
Maintainer

andrea-tomassi Mar 23, 2026
Author