Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] custom chat template sends to model [{'type': 'text', 'text': '...'}] #10324

Open
1 task done
victorserbu2709 opened this issue Nov 14, 2024 · 2 comments · May be fixed by #10164
Open
1 task done

[Bug] custom chat template sends to model [{'type': 'text', 'text': '...'}] #10324

victorserbu2709 opened this issue Nov 14, 2024 · 2 comments · May be fixed by #10164
Labels
bug Something isn't working

Comments

@victorserbu2709
Copy link

victorserbu2709 commented Nov 14, 2024

Your current environment

The output of `python collect_env.py`
Collecting environment information...

PyTorch version: 2.4.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.4 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.35

Python version: 3.12.7 (main, Oct  1 2024, 08:52:12) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-5.15.0-118-generic-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA H100 80GB HBM3
GPU 1: NVIDIA H100 80GB HBM3
GPU 2: NVIDIA H100 80GB HBM3
GPU 3: NVIDIA H100 80GB HBM3
GPU 4: NVIDIA H100 80GB HBM3
GPU 5: NVIDIA H100 80GB HBM3
GPU 6: NVIDIA H100 80GB HBM3
GPU 7: NVIDIA H100 80GB HBM3

Nvidia driver version: 555.42.02
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
Versions of relevant libraries:
[pip3] flashinfer==0.1.6+cu121torch2.4
[pip3] numpy==1.26.4
[pip3] nvidia-cublas-cu12==12.1.3.1
[pip3] nvidia-cuda-cupti-cu12==12.1.105
[pip3] nvidia-cuda-nvrtc-cu12==12.1.105
[pip3] nvidia-cuda-runtime-cu12==12.1.105
[pip3] nvidia-cudnn-cu12==9.1.0.70
[pip3] nvidia-cufft-cu12==11.0.2.54
[pip3] nvidia-curand-cu12==10.3.2.106
[pip3] nvidia-cusolver-cu12==11.4.5.107
[pip3] nvidia-cusparse-cu12==12.1.0.106
[pip3] nvidia-ml-py==12.560.30
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] nvidia-nvjitlink-cu12==12.6.77
[pip3] nvidia-nvtx-cu12==12.1.105
[pip3] pyzmq==26.2.0
[pip3] torch==2.4.0
[pip3] torchvision==0.19.0
[pip3] transformers==4.45.2
[pip3] triton==3.0.0
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.6.3.post1
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
NVIDIA_VISIBLE_DEVICES=all
NVIDIA_REQUIRE_CUDA=cuda>=12.4 brand=tesla,driver>=470,driver<471 brand=unknown,driver>=470,driver<471 brand=nvidia,driver>=470,driver<471 brand=nvidiartx,driver>=470,driver<471 brand=geforce,driver>=470,driver<471 brand=geforcertx,driver>=470,driver<471 brand=quadro,driver>=470,driver<471 brand=quadrortx,driver>=470,driver<471 brand=titan,driver>=470,driver<471 brand=titanrtx,driver>=470,driver<471 brand=tesla,driver>=525,driver<526 brand=unknown,driver>=525,driver<526 brand=nvidia,driver>=525,driver<526 brand=nvidiartx,driver>=525,driver<526 brand=geforce,driver>=525,driver<526 brand=geforcertx,driver>=525,driver<526 brand=quadro,driver>=525,driver<526 brand=quadrortx,driver>=525,driver<526 brand=titan,driver>=525,driver<526 brand=titanrtx,driver>=525,driver<526 brand=tesla,driver>=535,driver<536 brand=unknown,driver>=535,driver<536 brand=nvidia,driver>=535,driver<536 brand=nvidiartx,driver>=535,driver<536 brand=geforce,driver>=535,driver<536 brand=geforcertx,driver>=535,driver<536 brand=quadro,driver>=535,driver<536 brand=quadrortx,driver>=535,driver<536 brand=titan,driver>=535,driver<536 brand=titanrtx,driver>=535,driver<536
NVIDIA_DRIVER_CAPABILITIES=compute,utility
VLLM_USAGE_SOURCE=production-docker-image
CUDA_VERSION=12.4.1
LD_LIBRARY_PATH=/usr/local/lib/python3.12/dist-packages/cv2/../../lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
CUDA_MODULE_LOADING=LAZY

Model Input Dumps

prompt: "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 14 Nov 2024\n\n[{'type': 'text', 'text': 'you are a helpful assistant'}]<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n[{'type': 'text', 'text': 'hello\\n'}]<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n", params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.7, top_p=1.0, top_k=-1, min_p=0.0, seed=None, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=131004, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), guided_decoding=GuidedDecodingParams(json=None, regex=None, choice=None, grammar=None, json_object=None, backend=None, whitespace_pattern=None)

🐛 Describe the bug

Hello.
I created a simple container image that contains latest tool_chat_template_llama3.2_json.jinja

FROM docker.io/vllm/vllm-openai:v0.6.3.post1
COPY tool_chat_template_llama3.2_json.jinja vllm-workspace/tool_chat_template_llama3.2_json.jinja

The container is started using

localhost/vllm/vllm-openai:v0.6.3.post1-tools \
  --model neuralmagic/Llama-3.2-90B-Vision-Instruct-FP8-dynamic \
  --tensor-parallel-size 8 \
  --served-model-name "Llama3.2 90B" \
  --trust-remote-code \
  --gpu-memory-utilization 0.95 \
  --distributed-executor-backend mp \
  --enforce-eager \
  --max-num-seqs 2 \
  --limit-mm-per-prompt image=5 \
  --tool-call-parser llama3_json --chat-template /vllm-workspace/examples/tool_chat_template_llama3.2_json.jinja --enable-auto-tool-choice

Vllm openai receives following request

curl -v http://localhost:8000/v1/chat/completions -H 'content-type: application/json' --data '{"stream": false, "model": "Llama3.2 90B", "messages": [{"role": "system", "content": "you are a helpful assistant"}, {"role": "user", "content": "hello\n"}]}'

but in vllm logs i see user<|end_header_id|>\n\n[{'type': 'text', 'text': 'hello\n'}]<|eot_id|

INFO 11-14 03:51:42 logger.py:37] Received request chat-585357994ead43ab8d485844b632d641: prompt: "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 14 Nov 2024\n\n[{'type': 'text', 'text': 'you are a helpful assistant'}]<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n[{'type': 'text', 'text': 'hello\\n'}]<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n", params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.7, top_p=1.0, top_k=-1, min_p=0.0, seed=None, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=131004, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), guided_decoding=GuidedDecodingParams(json=None, regex=None, choice=None, grammar=None, json_object=None, backend=None, whitespace_pattern=None), prompt_token_ids: [128000, 128006, 9125, 128007, 271, 38766, 1303, 33025, 2696, 25, 6790, 220, 2366, 18, 198, 15724, 2696, 25, 220, 975, 4723, 220, 2366, 19, 271, 58, 13922, 1337, 1232, 364, 1342, 518, 364, 1342, 1232, 364, 9514, 527, 264, 11190, 18328, 8439, 60, 128009, 128006, 882, 128007, 271, 58, 13922, 1337, 1232, 364, 1342, 518, 364, 1342, 1232, 364, 15339, 1734, 8439, 60, 128009, 128006, 78191, 128007, 271], lora_request: None, prompt_adapter_request: None.

However, if i remove only

--chat-template /vllm-workspace/examples/tool_chat_template_llama3.2_json.jinja

from vllm start options, the model receives expected text (user<|end_header_id|>\n\nhello\n<|eot_id|)

curl -v http://localhost:8000/v1/chat/completions -H 'content-type: application/json' --data '{"stream": false, "model": "Llama3.2 90B", "messages": [{"role": "system", "content": "you are a helpful assistant"}, {"role": "user", "content": "hello\n"}]}'
INFO 11-14 04:00:43 logger.py:37] Received request chat-fb75d50bb91b4eb68814b86dbe0d4833: prompt: "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 14 Nov 2024\n\n[{'type': 'text', 'text': 'you are a helpful assistant'}]<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nhello\n<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n", params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.7, top_p=1.0, top_k=-1, min_p=0.0, seed=None, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=131017, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), guided_decoding=GuidedDecodingParams(json=None, regex=None, choice=None, grammar=None, json_object=None, backend=None, whitespace_pattern=None), prompt_token_ids: [128000, 128006, 9125, 128007, 271, 38766, 1303, 33025, 2696, 25, 6790, 220, 2366, 18, 198, 15724, 2696, 25, 220, 975, 4723, 220, 2366, 19, 271, 58, 13922, 1337, 1232, 364, 1342, 518, 364, 1342, 1232, 364, 9514, 527, 264, 11190, 18328, 8439, 60, 128009, 128006, 882, 128007, 271, 15339, 198, 128009, 128006, 78191, 128007, 271], lora_request: None, prompt_adapter_request: None.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
@victorserbu2709 victorserbu2709 added the bug Something isn't working label Nov 14, 2024
@DarkLight1337
Copy link
Member

Can you try out #10164?

@victorserbu2709
Copy link
Author

Thank you @DarkLight1337 , it works

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
2 participants