-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] gemma-3-27b-it-bnb-4bit crash #4897
Comments
It seems that you are using it in docker. There is no |
@kebe7jun ![]() ![]() ![]() root@docker-desktop:/sgl-workspace/sglang# python3 -m sglang.launch_server Loading safetensors checkpoint shards: 0% Completed | 0/4 [00:00<?, ?it/s] Loading safetensors checkpoint shards: 0% Completed | 0/4 [01:36<?, ?it/s] [2025-03-31 02:36:21] Received sigquit from a child process. It usually means the child failed. |
You are showing the host's memory. You are probably executing it in the VM. How much memory is allocated to the VM? Or you can try running |
@kebe7jun I'm using Docker. I used the --ipc host parameter to share all the memory with the host machine. Are you sure it's a memory Out of Memory (OOM) issue instead of a problem with the GPU memory? If it's due to the GPU memory, will a 27B model quantized to int4 consume all 24GB of the GPU memory? RuntimeError: CUDA error: out of memory [2025-04-02 03:49:42] Received sigquit from a child process. It usually means the child failed. |
Checklist
Describe the bug
1、When loading a 27B 4-bit quantized model, why does it exhaust the 24GB of gpu memory?

2、Why did the program crash? Is it because the gpu memory was exhausted?
[2025-03-29 18:56:25 TP0] Scheduler hit an exception: Traceback (most recent call last):
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1999, in run_scheduler_process
scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 249, in init
self.tp_worker = TpWorkerClass(
File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 74, in init
self.model_runner = ModelRunner(
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 169, in init
self.initialize(min_per_gpu_memory)
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 179, in initialize
self.load_model()
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 392, in load_model
self.model = get_model(
File "/sgl-workspace/sglang/python/sglang/srt/model_loader/init.py", line 22, in get_model
return loader.load_model(
File "/sgl-workspace/sglang/python/sglang/srt/model_loader/loader.py", line 1122, in load_model
self._load_weights(model_config, model)
File "/sgl-workspace/sglang/python/sglang/srt/model_loader/loader.py", line 1053, in _load_weights
model.load_weights(qweight_iterator)
File "/sgl-workspace/sglang/python/sglang/srt/models/gemma3_mm.py", line 436, in load_weights
causal_loaded_params = Gemma3ForCausalLM.load_weights(
File "/sgl-workspace/sglang/python/sglang/srt/models/gemma3_causal.py", line 666, in load_weights
weight_loader(param, loaded_weight, shard_id)
File "/sgl-workspace/sglang/python/sglang/srt/layers/linear.py", line 642, in weight_loader
assert param_data.shape == loaded_weight.shape
AssertionError
Loading safetensors checkpoint shards: 0% Completed | 0/4 [01:33<?, ?it/s]
[2025-03-29 18:56:25] Received sigquit from a child process. It usually means the child failed.
Killed
Reproduction
python3 -m sglang.launch_server --model-path /llm/model/google/unsloth_gemma-3-27b-it-unsloth-bnb-4bit/ --host 0.0.0.0 --port 30000 --trust-remote-code --load-format bitsandbytes --context-length 4096
INFO 03-29 18:46:53 init.py:190] Automatically detected platform cuda.
[2025-03-29 18:46:55] server_args=ServerArgs(model_path='/llm/model/google/unsloth_gemma-3-27b-it-unsloth-bnb-4bit/', tokenizer_path='/llm/model/google/unsloth_gemma-3-27b-it-unsloth-bnb-4bit/', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='bitsandbytes', trust_remote_code=True, dtype='auto', kv_cache_dtype='auto', quantization=None, quantization_param_path=None, context_length=4096, device='cuda', served_model_name='/llm/model/google/unsloth_gemma-3-27b-it-unsloth-bnb-4bit/', chat_template=None, completion_template=None, is_embedding=False, revision=None, host='0.0.0.0', port=30000, mem_fraction_static=0.88, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=2048, max_prefill_tokens=16384, schedule_policy='fcfs', schedule_conservativeness=1.0, cpu_offload_gb=0, page_size=1, tp_size=1, stream_interval=1, stream_output=False, random_seed=585300946, constrained_json_whitespace_pattern=None, watchdog_timeout=300, dist_timeout=None, download_dir=None, base_gpu_id=0, gpu_id_step=1, log_level='info', log_level_http=None, log_requests=False, log_requests_level=0, show_time_cost=False, enable_metrics=False, decode_log_interval=40, api_key=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser=None, dp_size=1, load_balance_method='round_robin', ep_size=1, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', lora_paths=None, max_loras_per_batch=8, lora_backend='triton', attention_backend='flashinfer', sampling_backend='flashinfer', grammar_backend='xgrammar', speculative_algorithm=None, speculative_draft_model_path=None, speculative_num_steps=5, speculative_eagle_topk=4, speculative_num_draft_tokens=8, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, disable_radix_cache=False, disable_cuda_graph=False, disable_cuda_graph_padding=False, enable_nccl_nvls=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, disable_mla=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_ep_moe=False, enable_deepep_moe=False, enable_torch_compile=False, torch_compile_max_bs=32, cuda_graph_max_bs=8, cuda_graph_bs=None, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, allow_auto_truncate=False, enable_custom_logit_processor=False, tool_call_parser=None, enable_hierarchical_cache=False, hicache_ratio=2.0, enable_flashinfer_mla=False, enable_flashmla=False, flashinfer_mla_disable_ragged=False, warmups=None, debug_tensor_dump_output_folder=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, disaggregation_mode='null', disaggregation_bootstrap_port=8998)
[2025-03-29 18:46:55] bitsandbytes quantization is not fully optimized yet. The speed can be slower than non-quantized models.
[2025-03-29 18:46:55] The following error message 'operation scheduled before its operands' can be ignored.
Using a slow image processor as
use_fast
is unset and a slow processor was saved with this model.use_fast=True
will be the default behavior in v4.50, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor withuse_fast=False
.INFO 03-29 18:46:57 init.py:190] Automatically detected platform cuda.
INFO 03-29 18:46:57 init.py:190] Automatically detected platform cuda.
[2025-03-29 18:46:58 TP0] bitsandbytes quantization is not fully optimized yet. The speed can be slower than non-quantized models.
Using a slow image processor as
use_fast
is unset and a slow processor was saved with this model.use_fast=True
will be the default behavior in v4.50, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor withuse_fast=False
.[2025-03-29 18:47:00 TP0] Overlap scheduler is disabled for multimodal models.
[2025-03-29 18:47:00 TP0] bitsandbytes quantization is not fully optimized yet. The speed can be slower than non-quantized models.
[2025-03-29 18:47:00 TP0] Automatically reduce --mem-fraction-static to 0.836 because this is a multimodal model.
[2025-03-29 18:47:00 TP0] Init torch distributed begin.
[2025-03-29 18:47:00 TP0] Init torch distributed ends. mem usage=0.00 GB
[2025-03-29 18:47:00 TP0] Load weight begin. avail mem=22.46 GB
[2025-03-29 18:47:01 TP0] The following error message 'operation scheduled before its operands' can be ignored.
[2025-03-29 18:47:01 TP0] Loading weights with BitsAndBytes quantization. May take a while ...
Loading safetensors checkpoint shards: 0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 25% Completed | 1/4 [01:52<05:36, 112.31s/it]
Loading safetensors checkpoint shards: 50% Completed | 2/4 [04:04<04:08, 124.16s/it]
Loading safetensors checkpoint shards: 75% Completed | 3/4 [06:23<02:10, 130.91s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [07:50<00:00, 113.37s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [07:50<00:00, 117.55s/it]
Loading safetensors checkpoint shards: 0% Completed | 0/4 [00:00<?, ?it/s]
Environment
GPU:4090
runtime env: lmsysorg/sglang:dev
The text was updated successfully, but these errors were encountered: