Description
Checklist
- 1. I have searched related issues but cannot get the expected help.
- 2. The bug has not been fixed in the latest version.
- 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
- 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
- 5. Please use English, otherwise it will be closed.
Describe the bug
1、When loading a 27B 4-bit quantized model, why does it exhaust the 24GB of gpu memory?
2、Why did the program crash? Is it because the gpu memory was exhausted?
[2025-03-29 18:56:25 TP0] Scheduler hit an exception: Traceback (most recent call last):
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1999, in run_scheduler_process
scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 249, in init
self.tp_worker = TpWorkerClass(
File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 74, in init
self.model_runner = ModelRunner(
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 169, in init
self.initialize(min_per_gpu_memory)
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 179, in initialize
self.load_model()
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 392, in load_model
self.model = get_model(
File "/sgl-workspace/sglang/python/sglang/srt/model_loader/init.py", line 22, in get_model
return loader.load_model(
File "/sgl-workspace/sglang/python/sglang/srt/model_loader/loader.py", line 1122, in load_model
self._load_weights(model_config, model)
File "/sgl-workspace/sglang/python/sglang/srt/model_loader/loader.py", line 1053, in _load_weights
model.load_weights(qweight_iterator)
File "/sgl-workspace/sglang/python/sglang/srt/models/gemma3_mm.py", line 436, in load_weights
causal_loaded_params = Gemma3ForCausalLM.load_weights(
File "/sgl-workspace/sglang/python/sglang/srt/models/gemma3_causal.py", line 666, in load_weights
weight_loader(param, loaded_weight, shard_id)
File "/sgl-workspace/sglang/python/sglang/srt/layers/linear.py", line 642, in weight_loader
assert param_data.shape == loaded_weight.shape
AssertionError
Loading safetensors checkpoint shards: 0% Completed | 0/4 [01:33<?, ?it/s]
[2025-03-29 18:56:25] Received sigquit from a child process. It usually means the child failed.
Killed
Reproduction
python3 -m sglang.launch_server --model-path /llm/model/google/unsloth_gemma-3-27b-it-unsloth-bnb-4bit/ --host 0.0.0.0 --port 30000 --trust-remote-code --load-format bitsandbytes --context-length 4096
INFO 03-29 18:46:53 init.py:190] Automatically detected platform cuda.
[2025-03-29 18:46:55] server_args=ServerArgs(model_path='/llm/model/google/unsloth_gemma-3-27b-it-unsloth-bnb-4bit/', tokenizer_path='/llm/model/google/unsloth_gemma-3-27b-it-unsloth-bnb-4bit/', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='bitsandbytes', trust_remote_code=True, dtype='auto', kv_cache_dtype='auto', quantization=None, quantization_param_path=None, context_length=4096, device='cuda', served_model_name='/llm/model/google/unsloth_gemma-3-27b-it-unsloth-bnb-4bit/', chat_template=None, completion_template=None, is_embedding=False, revision=None, host='0.0.0.0', port=30000, mem_fraction_static=0.88, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=2048, max_prefill_tokens=16384, schedule_policy='fcfs', schedule_conservativeness=1.0, cpu_offload_gb=0, page_size=1, tp_size=1, stream_interval=1, stream_output=False, random_seed=585300946, constrained_json_whitespace_pattern=None, watchdog_timeout=300, dist_timeout=None, download_dir=None, base_gpu_id=0, gpu_id_step=1, log_level='info', log_level_http=None, log_requests=False, log_requests_level=0, show_time_cost=False, enable_metrics=False, decode_log_interval=40, api_key=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser=None, dp_size=1, load_balance_method='round_robin', ep_size=1, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', lora_paths=None, max_loras_per_batch=8, lora_backend='triton', attention_backend='flashinfer', sampling_backend='flashinfer', grammar_backend='xgrammar', speculative_algorithm=None, speculative_draft_model_path=None, speculative_num_steps=5, speculative_eagle_topk=4, speculative_num_draft_tokens=8, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, disable_radix_cache=False, disable_cuda_graph=False, disable_cuda_graph_padding=False, enable_nccl_nvls=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, disable_mla=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_ep_moe=False, enable_deepep_moe=False, enable_torch_compile=False, torch_compile_max_bs=32, cuda_graph_max_bs=8, cuda_graph_bs=None, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, allow_auto_truncate=False, enable_custom_logit_processor=False, tool_call_parser=None, enable_hierarchical_cache=False, hicache_ratio=2.0, enable_flashinfer_mla=False, enable_flashmla=False, flashinfer_mla_disable_ragged=False, warmups=None, debug_tensor_dump_output_folder=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, disaggregation_mode='null', disaggregation_bootstrap_port=8998)
[2025-03-29 18:46:55] bitsandbytes quantization is not fully optimized yet. The speed can be slower than non-quantized models.
[2025-03-29 18:46:55] The following error message 'operation scheduled before its operands' can be ignored.
Using a slow image processor as use_fast
is unset and a slow processor was saved with this model. use_fast=True
will be the default behavior in v4.50, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with use_fast=False
.
INFO 03-29 18:46:57 init.py:190] Automatically detected platform cuda.
INFO 03-29 18:46:57 init.py:190] Automatically detected platform cuda.
[2025-03-29 18:46:58 TP0] bitsandbytes quantization is not fully optimized yet. The speed can be slower than non-quantized models.
Using a slow image processor as use_fast
is unset and a slow processor was saved with this model. use_fast=True
will be the default behavior in v4.50, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with use_fast=False
.
[2025-03-29 18:47:00 TP0] Overlap scheduler is disabled for multimodal models.
[2025-03-29 18:47:00 TP0] bitsandbytes quantization is not fully optimized yet. The speed can be slower than non-quantized models.
[2025-03-29 18:47:00 TP0] Automatically reduce --mem-fraction-static to 0.836 because this is a multimodal model.
[2025-03-29 18:47:00 TP0] Init torch distributed begin.
[2025-03-29 18:47:00 TP0] Init torch distributed ends. mem usage=0.00 GB
[2025-03-29 18:47:00 TP0] Load weight begin. avail mem=22.46 GB
[2025-03-29 18:47:01 TP0] The following error message 'operation scheduled before its operands' can be ignored.
[2025-03-29 18:47:01 TP0] Loading weights with BitsAndBytes quantization. May take a while ...
Loading safetensors checkpoint shards: 0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 25% Completed | 1/4 [01:52<05:36, 112.31s/it]
Loading safetensors checkpoint shards: 50% Completed | 2/4 [04:04<04:08, 124.16s/it]
Loading safetensors checkpoint shards: 75% Completed | 3/4 [06:23<02:10, 130.91s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [07:50<00:00, 113.37s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [07:50<00:00, 117.55s/it]
Loading safetensors checkpoint shards: 0% Completed | 0/4 [00:00<?, ?it/s]
Environment
GPU:4090
runtime env: lmsysorg/sglang:dev