Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Missing Torch and GGUF dependency in v0.4.4.post3 in Docker image #4900

Open
5 tasks done
davidsyoung opened this issue Mar 29, 2025 · 1 comment
Open
5 tasks done

Comments

@davidsyoung
Copy link

Checklist

  • 1. I have searched related issues but cannot get the expected help.
  • 2. The bug has not been fixed in the latest version.
  • 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
  • 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
  • 5. Please use English, otherwise it will be closed.

Describe the bug

When attempting to load DeepSeek R1 with GGUF I get this error:



==================================
== Triton Inference Server Base ==
==================================

NVIDIA Release 24.04 (build 90085237)

Copyright (c) 2018-2023, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

[2025-03-29 16:22:37] Fail to set RLIMIT_NOFILE: current limit exceeds maximum limit
[2025-03-29 16:22:37] server_args=ServerArgs(model_path='/models/dp-config/DeepSeek-R1-Q3_K_M.gguf', tokenizer_path='/models/dp-config/DeepSeek-R1-Q3_K_M.gguf', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='gguf', trust_remote_code=True, dtype='half', kv_cache_dtype='auto', quantization='gguf', quantization_param_path=None, context_length=2048, device='cuda', served_model_name='/models/dp-config/DeepSeek-R1-Q3_K_M.gguf', chat_template=None, completion_template=None, is_embedding=False, revision=None, host='127.0.0.1', port=30000, mem_fraction_static=0.995, max_running_requests=1, max_total_tokens=2048, chunked_prefill_size=2048, max_prefill_tokens=16384, schedule_policy='fcfs', schedule_conservativeness=1.0, cpu_offload_gb=0, page_size=1, tp_size=16, stream_interval=1, stream_output=False, random_seed=955017735, constrained_json_whitespace_pattern=None, watchdog_timeout=300, dist_timeout=None, download_dir=None, base_gpu_id=0, gpu_id_step=1, log_level='info', log_level_http=None, log_requests=False, log_requests_level=0, show_time_cost=False, enable_metrics=False, decode_log_interval=40, api_key=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser=None, dp_size=1, load_balance_method='round_robin', ep_size=1, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', lora_paths=None, max_loras_per_batch=8, lora_backend='triton', attention_backend='flashinfer', sampling_backend='flashinfer', grammar_backend='xgrammar', speculative_algorithm=None, speculative_draft_model_path=None, speculative_num_steps=5, speculative_eagle_topk=4, speculative_num_draft_tokens=8, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, disable_radix_cache=False, disable_cuda_graph=True, disable_cuda_graph_padding=False, enable_nccl_nvls=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, disable_mla=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_ep_moe=False, enable_deepep_moe=False, enable_torch_compile=False, torch_compile_max_bs=32, cuda_graph_max_bs=80, cuda_graph_bs=None, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, allow_auto_truncate=False, enable_custom_logit_processor=False, tool_call_parser=None, enable_hierarchical_cache=False, hicache_ratio=2.0, enable_flashinfer_mla=True, enable_flashmla=False, flashinfer_mla_disable_ragged=False, warmups=None, debug_tensor_dump_output_folder=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, disaggregation_mode='null', disaggregation_bootstrap_port=8998)
Loading a GGUF checkpoint in PyTorch, requires both PyTorch and GGUF>=0.10.0 to be installed. Please see https://pytorch.org/ and https://github.com/ggerganov/llama.cpp/tree/master/gguf-py for installation instructions.
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/sgl-workspace/sglang/python/sglang/launch_server.py", line 14, in <module>
    launch_server(server_args)
  File "/sgl-workspace/sglang/python/sglang/srt/entrypoints/http_server.py", line 679, in launch_server
    tokenizer_manager, scheduler_info = _launch_subprocesses(server_args=server_args)
  File "/sgl-workspace/sglang/python/sglang/srt/entrypoints/engine.py", line 546, in _launch_subprocesses
    tokenizer_manager = TokenizerManager(server_args, port_args)
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tokenizer_manager.py", line 159, in __init__
    self.model_config = ModelConfig(
  File "/sgl-workspace/sglang/python/sglang/srt/configs/model_config.py", line 60, in __init__
    self.hf_config = get_config(
  File "/sgl-workspace/sglang/python/sglang/srt/hf_transformers_utils.py", line 75, in get_config
    config = AutoConfig.from_pretrained(
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/configuration_auto.py", line 1096, in from_pretrained
    config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/configuration_utils.py", line 594, in get_config_dict
    config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/configuration_utils.py", line 685, in _get_config_dict
    config_dict = load_gguf_checkpoint(resolved_config_file, return_tensors=False)["config"]
  File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_gguf_pytorch_utils.py", line 360, in load_gguf_checkpoint
    raise ImportError("Please install torch and gguf>=0.10.0 to load a GGUF checkpoint in PyTorch.")
ImportError: Please install torch and gguf>=0.10.0 to load a GGUF checkpoint in PyTorch.

Reproduction

Latest docker image

Environment

N/a

@zhyncs
Copy link
Member

zhyncs commented Mar 29, 2025

We should sort out the dependencies in pyproject and classify them by vlm, quant, etc. Currently, because we don't pay attention to gguf in our daily test cases, we haven't tested whether gguf works with DeepSeek V3. @mickqian @yizhang2077

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants