-
Notifications
You must be signed in to change notification settings - Fork 641
[Fix] fix quant calibration dataset #4256
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Fix] fix quant calibration dataset #4256
Conversation
lmdeploy/cli/lite.py
Outdated
| type=int, | ||
| default=128, | ||
| help='Group size for weight quantization statistics') | ||
| parser.add_argument('--device', type=str, default='cuda', help='Device for calibrate. (cpu, cuda:0,1,2...)') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason to bring in this argument?
| :param ds: language dataset to preprocess and tokenize | ||
| :param tokenizer: tokenizer to be used for tokenization | ||
| :param max_seq_length: maximum sequence length of samples |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please refine the docstring to comply with lmdeploy's standards.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR refactors the quantization calibration dataset infrastructure by introducing a unified preprocessing pipeline, adding support for multiple new datasets, and improving error handling. The default calibration dataset changes from PTB to wikitext2, and the PTB dataset loader is removed entirely.
- Introduced a shared
process_datasethelper function that handles dataset-specific preprocessing and tokenization - Added support for 5 new calibration datasets: ultrachat_200k, gsm8k, neuralmagic_calibration, open-platypus, and openwebtext
- Updated API functions to load and pass AutoProcessor alongside AutoTokenizer for improved preprocessing capabilities
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 26 comments.
Show a summary per file
| File | Description |
|---|---|
| lmdeploy/lite/utils/calib_dataloader.py | Core refactor: added process_dataset helper, implemented 5 new dataset loaders with chat template support, simplified C4 loader, reworked pileval logic, removed PTB loaders, and updated get_calib_loaders to accept processor parameter |
| lmdeploy/lite/apis/calibrate.py | Changed default dataset to wikitext2, added AutoProcessor loading, expanded supported dataset list, and updated function calls to pass processor |
| lmdeploy/lite/apis/gptq.py | Changed default dataset to wikitext2, added AutoProcessor import and loading, added device parameter, and updated model initialization to use device |
| lmdeploy/lite/apis/auto_awq.py | Changed default calibration dataset from ptb to wikitext2 and updated docstring |
| lmdeploy/lite/apis/smooth_quant.py | Changed default calibration dataset from ptb to wikitext2 |
| lmdeploy/lite/quantization/awq.py | Added device parameter to max_memory_allocated calls for better multi-GPU support |
| lmdeploy/lite/quantization/calibration.py | Added device parameter to max_memory_allocated calls for better multi-GPU support |
| lmdeploy/cli/utils.py | Changed default --calib-dataset to wikitext2 and updated help text to list all supported datasets |
| lmdeploy/cli/lite.py | Added --device argument for auto_gptq and calibrate commands |
| docs/zh_cn/quantization/w4a16.md | Updated example to use wikitext2 instead of ptb |
| docs/en/quantization/w4a16.md | Updated example to use wikitext2 instead of ptb |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| """Load openwebtext train and test datasets and tokenize. | ||
| Args: | ||
| processor: Processor to apply chatplate encoding and encode text. |
Copilot
AI
Jan 5, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docstring says "chatplate" which appears to be a typo. This should likely be "chat template" to properly describe the functionality of applying a chat template for encoding.
| lengths.append(len(ids)) | ||
| if len(samples_encode) >= max_keep: | ||
| break | ||
|
|
Copilot
AI
Jan 5, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential division by zero error. If all samples in the dataset are filtered out (empty or too long), lengths will be empty and this will cause a ZeroDivisionError. Consider adding a check to ensure lengths is not empty before calculating the average.
| if not lengths: | |
| raise ValueError( | |
| 'No valid samples found in pileval dataset after filtering ' | |
| '(empty or >512 tokens). Please check the dataset or adjust ' | |
| 'filtering parameters.' | |
| ) |
| # open-platypus samples have far fewer tokens than seqlen; recompute how many | ||
| # train items to select so it can still yield enough samples after concatenation. | ||
| lengths = torch.tensor([len(sample['input_ids']) for sample in train_data], dtype=torch.long) | ||
| avg_tokens = lengths.sum().item() / len(train_data) |
Copilot
AI
Jan 5, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential division by zero error. If train_data is empty after filtering, this will cause a ZeroDivisionError. Consider adding a check to ensure train_data is not empty before calculating the average.
| avg_tokens = sum(lengths) / len(lengths) | ||
| needed_samples = (seqlen * nsamples) // avg_tokens |
Copilot
AI
Jan 5, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential issue: if avg_tokens is greater than seqlen * nsamples, the integer division will result in needed_samples = 0, which means no samples will be collected. This could lead to an empty samples list and a subsequent error when trying to concatenate. Consider adding a check to ensure needed_samples is at least 1.
| avg_tokens = sum(lengths) / len(lengths) | |
| needed_samples = (seqlen * nsamples) // avg_tokens | |
| if lengths: | |
| avg_tokens = sum(lengths) / len(lengths) | |
| needed_samples = max(1, int((seqlen * nsamples) // avg_tokens)) | |
| else: | |
| # Fallback: if no valid lengths were collected, use the original nsamples. | |
| needed_samples = max(1, int(nsamples)) |
| # train items to select so it can still yield enough samples after concatenation. | ||
| lengths = torch.tensor([len(sample['input_ids']) for sample in train_data], dtype=torch.long) | ||
| avg_tokens = lengths.sum().item() / len(train_data) | ||
| needed_samples = (seqlen * nsamples) // avg_tokens |
Copilot
AI
Jan 5, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential issue: if avg_tokens is greater than seqlen * nsamples, the integer division will result in needed_samples = 0, which means no samples will be collected. This could lead to an empty samples list and a subsequent error when trying to concatenate. Consider adding a check to ensure needed_samples is at least 1.
| needed_samples = (seqlen * nsamples) // avg_tokens | |
| if avg_tokens <= 0: | |
| needed_samples = 1 | |
| else: | |
| needed_samples = max(1, int((seqlen * nsamples) // avg_tokens)) |
| presets. | ||
| :param ds: language dataset to preprocess and tokenize | ||
| :param tokenizer: tokenizer to be used for tokenization |
Copilot
AI
Jan 5, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docstring parameter name is incorrect. The parameter is named 'processor' but the docstring refers to 'tokenizer'. This should be updated to match the actual parameter name.
| :param tokenizer: tokenizer to be used for tokenization | |
| :param processor: tokenizer to be used for tokenization |
| ) | ||
|
|
||
| else: | ||
| raise NotImplementedError(f'Cannot preprocess dataset {ds.info.dataset_name}') |
Copilot
AI
Jan 5, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error message for unsupported datasets should include all newly supported datasets. Currently, it only mentions the dataset name from ds.info.dataset_name without listing what datasets are actually supported by process_dataset.
| n_run += 1 | ||
| if n_run == nsamples: | ||
| if n_run == needed_samples: | ||
| break |
Copilot
AI
Jan 5, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential error when samples is empty. If all samples are filtered out or needed_samples is 0, torch.cat(samples, dim=1) will raise a RuntimeError. Consider adding a check to ensure samples is not empty before concatenation.
| break | |
| break | |
| if not samples: | |
| return [], None |
|
|
||
| def process(sample): | ||
| return processor( | ||
| processor.apply_chat_template( |
Copilot
AI
Jan 5, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code assumes that all processors have an apply_chat_template method, but this method is typically available on tokenizers, not on all processors. If a processor doesn't have this method, it will raise an AttributeError. Consider checking if the processor has this method or falling back to the tokenizer's method.
| 'content': sample['output'] | ||
| }] | ||
| return processor( | ||
| processor.apply_chat_template( |
Copilot
AI
Jan 5, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code assumes that all processors have an apply_chat_template method, but this method is typically available on tokenizers, not on all processors. If a processor doesn't have this method, it will raise an AttributeError. Consider checking if the processor has this method or falling back to the tokenizer's method.
* [ascend] fix paged prefill * update * update
* init * revert pre-commit * add await * lint * update docker install.sh (dlslime==0.0.1.post10)=>(dlslime==0.0.2) * fix type hint of endpoint_info in base * update docker install.sh (dlslime==0.0.2)=>(dlslime==0.0.2.post1)
* fix reqs * fix reqs
This reverts commit be3f2cc8338ff6e8d4f00ff2239d3360254fdba8.
Thanks for your contribution and we appreciate it a lot. The following instructions would make your pull request more healthy and more easily receiving feedbacks. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers.
Motivation
Reworked calibration data loading: shared processor-based preprocessing, safer C4 loading, pileval flow adjusted, and new loaders for ultrachat_200k, gsm8k, neuralmagic_calibration, open-platypus, and openwebtext; removed ptb.
Modification
Use cases (Optional)
calibrate:
lmdeploy lite calibrate internlm/internlm2_5-7b-chatauto_awq:
lmdeploy lite auto_awq internlm/internlm2_5-7b-chatauto_gptq:
lmdeploy lite auto_gptq internlm/internlm2_5-7b-chatsmooth_quant:
lmdeploy lite smooth_quant internlm/internlm2_5-7b-chathelp messages:
lmdeploy lite calibrate -hChecklist