[AWQ] Generalize AWQ quantization #1961

kylesayrs · 2025-10-22T16:45:52Z

Summary

To allow for arbitrary heterogeneous quantization schemes, this PR switches several helpers from AutoAWQ to the observer and QDQ logic. AWQ no longer constrains that the quantization config needs to have the same settings for group_size, symmetric, and num_bits for each config_group.

Resolves #1657

Prerequisites:

patch_attrs helper compressed-tensors#519

Test plan

When running llm-compressor/examples/awq/llama_example.py with this (with duo_scaling="both") and logging the best configuration of (ratio, duo_scaling), I see a good mix of Falses and Trues. i.e. a good percentage of best_scales were found with duo_scaling=False and a good percentage were found with duo_scaling=True. Generated model output looks good.
When using awq_one_shot.py (pasted below), Wikitext PPL is consistent for w4a16 and w4a16_asym on this branch when compared to main, and better than what was reported in a previous AWQ PR, but those might have been differently configured. For W4A16_ASYM, the results are both 13.41 for main and this branch. This is what we've been historically using to test regressions.

Scheme	Wikitext PPL RTN	AWQ main	AWQ this branch
W4A16	13.784	13.477	13.426
W4A16_ASYM	13.606	13.346	13.377

I see a small regression in recovery when running CADENCE=weekly TEST_DATA_FILE=~/projects/llm-compressor/tests/lmeval/configs/w4a16_awq_sym.yaml pytest -s ~/projects/llm-compressor/tests/lmeval/test_lmeval.py on this branch, which causes the test to fail. This persists even when using pseudo_quantize_tensor instead of call_observer/forward_quantize, as shown in this diff. I get the same result in this diff, so at least that means quantization logic in CT is consistent with AutoAWQ
Output:

<main>
2025-11-17T18:26:04.682699+0000 | _validate_recovery | INFO - ✓ exact_match,strict-match                 | Base: 0.7650 | Compressed: 0.7090 | Recovery: 92.68% ↑ | Threshold: ≥92.00%
2025-11-17T18:26:04.682811+0000 | _validate_recovery | INFO - ✓ exact_match,flexible-extract             | Base: 0.7630 | Compressed: 0.7100 | Recovery: 93.05% ↑ | Threshold: ≥93.00%
<this branch>
2025-11-17T17:55:00.648672+0000 | _validate_recovery | ERROR - ✗ exact_match,strict-match                 | Base: 0.7650 | Compressed: 0.6950 | Recovery: 90.85% ↑ | Threshold: ≥92.00%
2025-11-17T17:55:00.648967+0000 | _validate_recovery | ERROR - ✗ exact_match,flexible-extract             | Base: 0.7630 | Compressed: 0.6960 | Recovery: 91.22% ↑ | Threshold: ≥93.00%

This is already a pretty high drop in recovery, should we revisit this test?

awq_oneshot.py script

```python import os

os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"

from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

from llmcompressor import oneshot, active_session
from llmcompressor.utils import dispatch_for_generation
from llmcompressor.modifiers.awq import AWQModifier, AWQMapping
from llmcompressor.modifiers.quantization import QuantizationModifier
from compressed_tensors.quantization import (
QuantizationArgs,
QuantizationScheme,
QuantizationStrategy,
QuantizationType,
)

MODEL_ID = "meta-llama/Llama-3.2-3B-Instruct"

SAVE_DIR = MODEL_ID.split("/")[-1] + "-awq-asym"

Configure the quantization algorithm to run.

recipe = [
AWQModifier(
ignore=[
"lm_head",
"re:.*mlp.gate$",
"re:.mlp.shared_expert_gate$",
"re:visual.",
],
scheme="W4A16_ASYM",
duo_scaling="both",
targets=["Linear"],
# offload_device=torch.device("cpu"),
),
]

Select calibration dataset.

DATASET_ID = "mit-han-lab/pile-val-backup"
DATASET_SPLIT = "validation"

Select number of samples. 256 samples is a good place to start.

Increasing the number of samples can improve accuracy.

NUM_CALIBRATION_SAMPLES = 256
MAX_SEQUENCE_LENGTH = 512

def get_calib_dataset(tokenizer):
from datasets import load_dataset

ds = load_dataset(
    DATASET_ID,
    split=f"{DATASET_SPLIT}[:{NUM_CALIBRATION_SAMPLES*10}]",
)

def preprocess(example):
    return {"input_ids": tokenizer.encode(example["text"].strip())}

ds = (
    ds.shuffle(seed=42)
    .map(preprocess, remove_columns=ds.column_names)
    .select(range(NUM_CALIBRATION_SAMPLES))
)

return ds

if name == "main":
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID, torch_dtype="auto", trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)

###
### Apply algorithms.
###
oneshot(
    model=model,
    dataset=get_calib_dataset(tokenizer),
    recipe=recipe,
    max_seq_length=MAX_SEQUENCE_LENGTH,
    num_calibration_samples=NUM_CALIBRATION_SAMPLES,
    log_dir=None,
    trust_remote_code_model=True,
)

# Confirm generations of the quantized model look sane.
dispatch_for_generation(model)
print("\n\n")
print("========== SAMPLE GENERATION ==============")
input_ids = tokenizer("Hello my name is", return_tensors="pt").input_ids.to("cuda")
output = model.generate(input_ids, max_new_tokens=100)
print(tokenizer.decode(output[0]))
print("==========================================\n\n")

# Save to disk compressed.
model.save_pretrained(SAVE_DIR)
tokenizer.save_pretrained(SAVE_DIR)

##
### Apply algorithms.
##

## LM EVAL

active_session().reset()
del model
del tokenizer
torch.cuda.empty_cache()

import lm_eval
from lm_eval.utils import make_table

results = lm_eval.simple_evaluate(
    model="vllm",
    model_args={
        "pretrained": SAVE_DIR,
        "add_bos_token": True,
        "dtype": "bfloat16",
        "gpu_memory_utilization": 0.7,
        "max_model_len": 4096,
        # "max_num_batched_tokens": 128,
        # "max_num_seqs": 128,
    },
    tasks=["wikitext"],
    batch_size=128,
)
print(make_table(results))

</details>

Signed-off-by: Kyle Sayers <[email protected]>

github-actions · 2025-10-22T16:46:06Z

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

src/llmcompressor/modifiers/awq/base.py

Signed-off-by: Brian Dellabetta <[email protected]>

kylesayrs

I think that so long as you feel confident that _compute_layer_means is going to work as expected for all the supported strategies, then I think this looks good to me!

kylesayrs · 2025-11-14T00:15:48Z

src/llmcompressor/modifiers/awq/base.py

+                    module=balance_layer,
                )
+                for balance_layer in mapping.balance_layers
+                if hasattr(balance_layer, "quantization_scheme")


Will this function error if balance layer doesn't have a quantization scheme?

I updated this to skip balance_layers that don't have a quant schema, if that does arise. So this should be robust enough to still work when someone wants to update a mapping like input_layernorm -> q/k/v proj, but does NOT want to quantize all q/k/v proj layers.

Nice, that seems like the most robust solution.

src/llmcompressor/modifiers/awq/base.py

kylesayrs · 2025-11-14T00:31:26Z

src/llmcompressor/modifiers/awq/base.py

-
-    return w, scales, zeros
+        for layer in layers:
+            if not hasattr(layer, "weight"):


I feel like a lot of this algorithm assumes that layers have weights? Should we be silently skipping here?

To date, all the balance_layers are linear layers. I could just change the AWQMapping type to linear and avoid this checking logic. WDYT?

Anything's good with me, I prefer loud errors if assumptions are violated in this case.

kylesayrs · 2025-11-14T04:04:13Z

src/llmcompressor/modifiers/awq/base.py

+            weight = layer.weight
+            org_shape = weight.shape
+
+            # If group-wise, calculate abs max based on group


Will this generalize to block or tensor quant? I think I personally need to get a better understanding of what "group normalization" is supposed to do and how it applies to other quant strategies.

Writing this function with torch native vectorized ops might help, I think it might be reducible to frobenius norm and mean/sum ops, but that doesn't have to be done now.

for tensor, i don't think using channel-wise means makes sense. We could validate that if need be, i.e. throw validation error if strategy==TENSOR and duo_scaling != False.

For block, we'd have to update this logic, yeah. I'll leave as a todo for now

Signed-off-by: Brian Dellabetta <[email protected]>

kylesayrs

Approve from my side

fynnsu

Looks good, added a couple comments below!

fynnsu · 2025-11-17T21:36:25Z

src/llmcompressor/modifiers/awq/base.py

+            if weight_total_sum is None:
+                weight_total_sum = weight_sum
+            else:
+                weight_total_sum += weight_sum


This seems a little strange to me. Can't we just initialize weight_total_sum = 0.0?

fynnsu · 2025-11-17T21:44:27Z

src/llmcompressor/modifiers/awq/base.py

+                history.append(loss)
+                if loss < best_error:
+                    best_error = loss
+                    best_duo_scaling = use_duo_scaling
+                    best_ratio = ratio
+                    best_scales = scales.clone()


It seems like history is currently just used for debugging when no best_ratio is found. I wonder if we could instead be recording saving the hyperparameter states and losses and printing them everytime (when in DEBUG logging mode.

e.g.

for grid_idx, use_duo_scaling in product(range(n_grid), duo_scalings): ratio = grid_idx / n_grid ... history.append({"ratio": ratio, "duo_scaling": use_duo_scaling, "error": loss}) ... logger.debug(history)

This might be useful in the future as we look into improving the hyperparameter search / to get a sense of what parameters are most often selected. I think including the ratio/duo_scaling in some way is important now that we've switched from a simple linear search to a grid search, so that's easy to tell which arguments are being set.

kylesayrs added 3 commits October 22, 2025 12:17

WIP

2dfce99

Signed-off-by: Kyle Sayers <[email protected]>

add todo

2ea6428

Signed-off-by: Kyle Sayers <[email protected]>

forward quantize

8174d06

Signed-off-by: Kyle Sayers <[email protected]>

kylesayrs commented Oct 29, 2025

View reviewed changes

src/llmcompressor/modifiers/awq/base.py Outdated Show resolved Hide resolved

brian-dellabetta added 3 commits November 13, 2025 20:00

Merge branch 'main' into kylesayrs/awq-generalize-quant

77da494

more updates

6d6d382

Signed-off-by: Brian Dellabetta <[email protected]>

working

743f4dc

Signed-off-by: Brian Dellabetta <[email protected]>

brian-dellabetta mentioned this pull request Nov 13, 2025

patch_attrs helper vllm-project/compressed-tensors#519

Merged

docstrings

c73806a

Signed-off-by: Brian Dellabetta <[email protected]>

brian-dellabetta changed the title ~~[WIP] Generalize AWQ quantization~~ [AWQ] Generalize AWQ quantization Nov 13, 2025

brian-dellabetta requested review from HDCharles and fynnsu November 13, 2025 23:12

brian-dellabetta and others added 3 commits November 13, 2025 18:12

Merge branch 'main' into kylesayrs/awq-generalize-quant

649a0d6

touchup

71a961e

Signed-off-by: Brian Dellabetta <[email protected]>

formatting

842ff1c

Signed-off-by: Brian Dellabetta <[email protected]>

kylesayrs commented Nov 14, 2025

View reviewed changes

brian-dellabetta added 3 commits November 14, 2025 18:42

unit test for compute layer means

0ee2382

Signed-off-by: Brian Dellabetta <[email protected]>

improve validation logic in compute best scale

e54daf8

Signed-off-by: Brian Dellabetta <[email protected]>

add block-wise TODO

554442c

Signed-off-by: Brian Dellabetta <[email protected]>

brian-dellabetta mentioned this pull request Nov 14, 2025

Add awq activation fp8 support in loss compute #1873

Open

brian-dellabetta added 3 commits November 14, 2025 20:57

minor cleanup

1c6696a

Signed-off-by: Brian Dellabetta <[email protected]>

remove validation tests

65c3b2d

Signed-off-by: Brian Dellabetta <[email protected]>

remove validation tests

bdcdca4

Signed-off-by: Brian Dellabetta <[email protected]>

kylesayrs commented Nov 15, 2025

View reviewed changes

brian-dellabetta mentioned this pull request Nov 17, 2025

[AWQ] Allow for activation quantization #1682

Closed

HDCharles approved these changes Nov 17, 2025

View reviewed changes

fynnsu approved these changes Nov 17, 2025

View reviewed changes

[AWQ] Generalize AWQ quantization #1961

Are you sure you want to change the base?

[AWQ] Generalize AWQ quantization #1961

Conversation

kylesayrs commented Oct 22, 2025 • edited by brian-dellabetta Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Configure the quantization algorithm to run.

Select calibration dataset.

Select number of samples. 256 samples is a good place to start.

Increasing the number of samples can improve accuracy.

Uh oh!

github-actions bot commented Oct 22, 2025

Uh oh!

Uh oh!

kylesayrs left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kylesayrs Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

brian-dellabetta Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kylesayrs Nov 15, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kylesayrs Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

brian-dellabetta Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

kylesayrs Nov 15, 2025

Choose a reason for hiding this comment

Uh oh!

kylesayrs Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

brian-dellabetta Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

kylesayrs left a comment

Choose a reason for hiding this comment

Uh oh!

fynnsu left a comment

Choose a reason for hiding this comment

Uh oh!

fynnsu Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

fynnsu Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

kylesayrs commented Oct 22, 2025 •

edited by brian-dellabetta

Loading

kylesayrs left a comment •

edited

Loading

kylesayrs Nov 14, 2025 •

edited

Loading

brian-dellabetta Nov 14, 2025 •

edited

Loading

kylesayrs Nov 14, 2025 •

edited

Loading