Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] gemma-3-27b-it-bnb-4bit crash #4897

Open
5 tasks done
bebilli opened this issue Mar 29, 2025 · 4 comments
Open
5 tasks done

[Bug] gemma-3-27b-it-bnb-4bit crash #4897

bebilli opened this issue Mar 29, 2025 · 4 comments

Comments

@bebilli
Copy link

bebilli commented Mar 29, 2025

Checklist

  • 1. I have searched related issues but cannot get the expected help.
  • 2. The bug has not been fixed in the latest version.
  • 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
  • 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
  • 5. Please use English, otherwise it will be closed.

Describe the bug

1、When loading a 27B 4-bit quantized model, why does it exhaust the 24GB of gpu memory?
2、Why did the program crash? Is it because the gpu memory was exhausted?
Image
[2025-03-29 18:56:25 TP0] Scheduler hit an exception: Traceback (most recent call last):
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1999, in run_scheduler_process
scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 249, in init
self.tp_worker = TpWorkerClass(
File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 74, in init
self.model_runner = ModelRunner(
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 169, in init
self.initialize(min_per_gpu_memory)
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 179, in initialize
self.load_model()
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 392, in load_model
self.model = get_model(
File "/sgl-workspace/sglang/python/sglang/srt/model_loader/init.py", line 22, in get_model
return loader.load_model(
File "/sgl-workspace/sglang/python/sglang/srt/model_loader/loader.py", line 1122, in load_model
self._load_weights(model_config, model)
File "/sgl-workspace/sglang/python/sglang/srt/model_loader/loader.py", line 1053, in _load_weights
model.load_weights(qweight_iterator)
File "/sgl-workspace/sglang/python/sglang/srt/models/gemma3_mm.py", line 436, in load_weights
causal_loaded_params = Gemma3ForCausalLM.load_weights(
File "/sgl-workspace/sglang/python/sglang/srt/models/gemma3_causal.py", line 666, in load_weights
weight_loader(param, loaded_weight, shard_id)
File "/sgl-workspace/sglang/python/sglang/srt/layers/linear.py", line 642, in weight_loader
assert param_data.shape == loaded_weight.shape
AssertionError

Loading safetensors checkpoint shards: 0% Completed | 0/4 [01:33<?, ?it/s]

[2025-03-29 18:56:25] Received sigquit from a child process. It usually means the child failed.
Killed

Reproduction

python3 -m sglang.launch_server --model-path /llm/model/google/unsloth_gemma-3-27b-it-unsloth-bnb-4bit/ --host 0.0.0.0 --port 30000 --trust-remote-code --load-format bitsandbytes --context-length 4096
INFO 03-29 18:46:53 init.py:190] Automatically detected platform cuda.
[2025-03-29 18:46:55] server_args=ServerArgs(model_path='/llm/model/google/unsloth_gemma-3-27b-it-unsloth-bnb-4bit/', tokenizer_path='/llm/model/google/unsloth_gemma-3-27b-it-unsloth-bnb-4bit/', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='bitsandbytes', trust_remote_code=True, dtype='auto', kv_cache_dtype='auto', quantization=None, quantization_param_path=None, context_length=4096, device='cuda', served_model_name='/llm/model/google/unsloth_gemma-3-27b-it-unsloth-bnb-4bit/', chat_template=None, completion_template=None, is_embedding=False, revision=None, host='0.0.0.0', port=30000, mem_fraction_static=0.88, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=2048, max_prefill_tokens=16384, schedule_policy='fcfs', schedule_conservativeness=1.0, cpu_offload_gb=0, page_size=1, tp_size=1, stream_interval=1, stream_output=False, random_seed=585300946, constrained_json_whitespace_pattern=None, watchdog_timeout=300, dist_timeout=None, download_dir=None, base_gpu_id=0, gpu_id_step=1, log_level='info', log_level_http=None, log_requests=False, log_requests_level=0, show_time_cost=False, enable_metrics=False, decode_log_interval=40, api_key=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser=None, dp_size=1, load_balance_method='round_robin', ep_size=1, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', lora_paths=None, max_loras_per_batch=8, lora_backend='triton', attention_backend='flashinfer', sampling_backend='flashinfer', grammar_backend='xgrammar', speculative_algorithm=None, speculative_draft_model_path=None, speculative_num_steps=5, speculative_eagle_topk=4, speculative_num_draft_tokens=8, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, disable_radix_cache=False, disable_cuda_graph=False, disable_cuda_graph_padding=False, enable_nccl_nvls=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, disable_mla=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_ep_moe=False, enable_deepep_moe=False, enable_torch_compile=False, torch_compile_max_bs=32, cuda_graph_max_bs=8, cuda_graph_bs=None, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, allow_auto_truncate=False, enable_custom_logit_processor=False, tool_call_parser=None, enable_hierarchical_cache=False, hicache_ratio=2.0, enable_flashinfer_mla=False, enable_flashmla=False, flashinfer_mla_disable_ragged=False, warmups=None, debug_tensor_dump_output_folder=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, disaggregation_mode='null', disaggregation_bootstrap_port=8998)
[2025-03-29 18:46:55] bitsandbytes quantization is not fully optimized yet. The speed can be slower than non-quantized models.
[2025-03-29 18:46:55] The following error message 'operation scheduled before its operands' can be ignored.
Using a slow image processor as use_fast is unset and a slow processor was saved with this model. use_fast=True will be the default behavior in v4.50, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with use_fast=False.
INFO 03-29 18:46:57 init.py:190] Automatically detected platform cuda.
INFO 03-29 18:46:57 init.py:190] Automatically detected platform cuda.
[2025-03-29 18:46:58 TP0] bitsandbytes quantization is not fully optimized yet. The speed can be slower than non-quantized models.
Using a slow image processor as use_fast is unset and a slow processor was saved with this model. use_fast=True will be the default behavior in v4.50, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with use_fast=False.
[2025-03-29 18:47:00 TP0] Overlap scheduler is disabled for multimodal models.
[2025-03-29 18:47:00 TP0] bitsandbytes quantization is not fully optimized yet. The speed can be slower than non-quantized models.
[2025-03-29 18:47:00 TP0] Automatically reduce --mem-fraction-static to 0.836 because this is a multimodal model.
[2025-03-29 18:47:00 TP0] Init torch distributed begin.
[2025-03-29 18:47:00 TP0] Init torch distributed ends. mem usage=0.00 GB
[2025-03-29 18:47:00 TP0] Load weight begin. avail mem=22.46 GB
[2025-03-29 18:47:01 TP0] The following error message 'operation scheduled before its operands' can be ignored.
[2025-03-29 18:47:01 TP0] Loading weights with BitsAndBytes quantization. May take a while ...
Loading safetensors checkpoint shards: 0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 25% Completed | 1/4 [01:52<05:36, 112.31s/it]
Loading safetensors checkpoint shards: 50% Completed | 2/4 [04:04<04:08, 124.16s/it]
Loading safetensors checkpoint shards: 75% Completed | 3/4 [06:23<02:10, 130.91s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [07:50<00:00, 113.37s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [07:50<00:00, 117.55s/it]

Loading safetensors checkpoint shards: 0% Completed | 0/4 [00:00<?, ?it/s]

Environment

GPU:4090
runtime env: lmsysorg/sglang:dev

@bebilli bebilli changed the title [Bug] crash [Bug] gemma-3-27b-it-bnb-4bit crash Mar 29, 2025
@kebe7jun
Copy link
Collaborator

It seems that you are using it in docker. There is no CUDA out of memory error. It may also be memory OOM. Can you try to allocate more memory to docker?

@bebilli
Copy link
Author

bebilli commented Mar 31, 2025

@kebe7jun
mem is 64GB, run docker with --ipc host
docker run -d -it --net host --ipc host --gpus all -v d:/llm:/llm --name sglang lmsysorg/sglang:dev

Image Image Image

root@docker-desktop:/sgl-workspace/sglang# python3 -m sglang.launch_server
--model-path /llm/model/google/unsloth_gemma-3-27b-it-unsloth-bnb-4bit/
--dtype bfloat16
--host 0.0.0.0
--port 30000
--kv-cache-dtype fp8_e4m3
--quantization bitsandbytes
--trust-remote-code
--context-length 4096
--load-format bitsandbytes
--mem-fraction-static 1
INFO 03-31 02:28:30 init.py:190] Automatically detected platform cuda.
[2025-03-31 02:28:31] server_args=ServerArgs(model_path='/llm/model/google/unsloth_gemma-3-27b-it-unsloth-bnb-4bit/', tokenizer_path='/llm/model/google/unsloth_gemma-3-27b-it-unsloth-bnb-4bit/', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='bitsandbytes', trust_remote_code=True, dtype='bfloat16', kv_cache_dtype='fp8_e4m3', quantization='bitsandbytes', quantization_param_path=None, context_length=4096, device='cuda', served_model_name='/llm/model/google/unsloth_gemma-3-27b-it-unsloth-bnb-4bit/', chat_template=None, completion_template=None, is_embedding=False, revision=None, host='0.0.0.0', port=30000, mem_fraction_static=1.0, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=2048, max_prefill_tokens=16384, schedule_policy='fcfs', schedule_conservativeness=1.0, cpu_offload_gb=0, page_size=1, tp_size=1, stream_interval=1, stream_output=False, random_seed=672695043, constrained_json_whitespace_pattern=None, watchdog_timeout=300, dist_timeout=None, download_dir=None, base_gpu_id=0, gpu_id_step=1, log_level='info', log_level_http=None, log_requests=False, log_requests_level=0, show_time_cost=False, enable_metrics=False, decode_log_interval=40, api_key=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser=None, dp_size=1, load_balance_method='round_robin', ep_size=1, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', lora_paths=None, max_loras_per_batch=8, lora_backend='triton', attention_backend='flashinfer', sampling_backend='flashinfer', grammar_backend='xgrammar', speculative_algorithm=None, speculative_draft_model_path=None, speculative_num_steps=5, speculative_eagle_topk=4, speculative_num_draft_tokens=8, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, disable_radix_cache=False, disable_cuda_graph=False, disable_cuda_graph_padding=False, enable_nccl_nvls=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, disable_mla=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_ep_moe=False, enable_deepep_moe=False, enable_torch_compile=False, torch_compile_max_bs=32, cuda_graph_max_bs=8, cuda_graph_bs=None, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, allow_auto_truncate=False, enable_custom_logit_processor=False, tool_call_parser=None, enable_hierarchical_cache=False, hicache_ratio=2.0, enable_flashinfer_mla=False, enable_flashmla=False, flashinfer_mla_disable_ragged=False, warmups=None, debug_tensor_dump_output_folder=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, disaggregation_mode='null', disaggregation_bootstrap_port=8998)
[2025-03-31 02:28:31] bitsandbytes quantization is not fully optimized yet. The speed can be slower than non-quantized models.
[2025-03-31 02:28:31] The following error message 'operation scheduled before its operands' can be ignored.
Using a slow image processor as use_fast is unset and a slow processor was saved with this model. use_fast=True will be the default behavior in v4.50, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with use_fast=False.
INFO 03-31 02:28:33 init.py:190] Automatically detected platform cuda.
INFO 03-31 02:28:33 init.py:190] Automatically detected platform cuda.
[2025-03-31 02:28:34 TP0] bitsandbytes quantization is not fully optimized yet. The speed can be slower than non-quantized models.
Using a slow image processor as use_fast is unset and a slow processor was saved with this model. use_fast=True will be the default behavior in v4.50, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with use_fast=False.
[2025-03-31 02:28:36 TP0] Overlap scheduler is disabled for multimodal models.
[2025-03-31 02:28:36 TP0] bitsandbytes quantization is not fully optimized yet. The speed can be slower than non-quantized models.
[2025-03-31 02:28:36 TP0] Automatically reduce --mem-fraction-static to 0.950 because this is a multimodal model.
[2025-03-31 02:28:36 TP0] Init torch distributed begin.
[2025-03-31 02:28:36 TP0] Init torch distributed ends. mem usage=0.00 GB
[2025-03-31 02:28:36 TP0] Load weight begin. avail mem=22.46 GB
[2025-03-31 02:28:36 TP0] The following error message 'operation scheduled before its operands' can be ignored.
[2025-03-31 02:28:37 TP0] Loading weights with BitsAndBytes quantization. May take a while ...
Loading safetensors checkpoint shards: 0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 25% Completed | 1/4 [01:36<04:49, 96.48s/it]
Loading safetensors checkpoint shards: 50% Completed | 2/4 [03:14<03:14, 97.45s/it]
Loading safetensors checkpoint shards: 75% Completed | 3/4 [04:55<01:39, 99.06s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [06:07<00:00, 88.43s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [06:07<00:00, 91.93s/it]

Loading safetensors checkpoint shards: 0% Completed | 0/4 [00:00<?, ?it/s]
[2025-03-31 02:36:21 TP0] Scheduler hit an exception: Traceback (most recent call last):
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1999, in run_scheduler_process
scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 249, in init
self.tp_worker = TpWorkerClass(
File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 74, in init
self.model_runner = ModelRunner(
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 169, in init
self.initialize(min_per_gpu_memory)
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 179, in initialize
self.load_model()
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 392, in load_model
self.model = get_model(
File "/sgl-workspace/sglang/python/sglang/srt/model_loader/init.py", line 22, in get_model
return loader.load_model(
File "/sgl-workspace/sglang/python/sglang/srt/model_loader/loader.py", line 1122, in load_model
self._load_weights(model_config, model)
File "/sgl-workspace/sglang/python/sglang/srt/model_loader/loader.py", line 1053, in _load_weights
model.load_weights(qweight_iterator)
File "/sgl-workspace/sglang/python/sglang/srt/models/gemma3_mm.py", line 436, in load_weights
causal_loaded_params = Gemma3ForCausalLM.load_weights(
File "/sgl-workspace/sglang/python/sglang/srt/models/gemma3_causal.py", line 666, in load_weights
weight_loader(param, loaded_weight, shard_id)
File "/sgl-workspace/sglang/python/sglang/srt/layers/linear.py", line 642, in weight_loader
assert param_data.shape == loaded_weight.shape
AssertionError

Loading safetensors checkpoint shards: 0% Completed | 0/4 [01:36<?, ?it/s]

[2025-03-31 02:36:21] Received sigquit from a child process. It usually means the child failed.
Killed

@kebe7jun
Copy link
Collaborator

You are showing the host's memory. You are probably executing it in the VM. How much memory is allocated to the VM? Or you can try running dmesg -T | grep -i oom in the VM to check.

@kebe7jun kebe7jun mentioned this issue Mar 31, 2025
5 tasks
@bebilli
Copy link
Author

bebilli commented Apr 2, 2025

@kebe7jun I'm using Docker. I used the --ipc host parameter to share all the memory with the host machine. Are you sure it's a memory Out of Memory (OOM) issue instead of a problem with the GPU memory? If it's due to the GPU memory, will a 27B model quantized to int4 consume all 24GB of the GPU memory?
The following is the execution situation of the command, and there are no relevant logs:

RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

[2025-04-02 03:49:42] Received sigquit from a child process. It usually means the child failed.
Killed
root@docker-desktop:/sgl-workspace/sglang# dmesg -T | grep -i oom
root@docker-desktop:/sgl-workspace/sglang#

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants