-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Description
System Info
GPU: A100
TensorRT-LLM version: 1.0.0rc4 (I am using the prebuilt container)
Who can help?
No response
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
I get wrong results after prefill/decode disaggregation (prefill.tp_size == 2 and decode.tp_size == 1). Here are the reproduce steps:
Launch container by:
docker run -v $PWD:/mnt -v /aisw:/aisw -e EXEC_BASH=1 --net=host -v $PWD:/mnt -w /mnt --pid=host --rm -it --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --gpus=all nvcr.io/nvidia/tensorrt-llm/release:1.0.0rc4 /bin/bash
Launch the servers:
# tp2.yaml
attn_backend: FLASHINFER
moe_config:
backend: CUTLASS
disable_overlap_scheduler: True
cache_transceiver_config:
backend: ucx
max_tokens_in_buffer: 2048
tensor_parallel_size: 2
# tp1.yaml
attn_backend: FLASHINFER
moe_config:
backend: CUTLASS
disable_overlap_scheduler: True
cache_transceiver_config:
backend: ucx
max_tokens_in_buffer: 2048
# disagg_config.yaml
hostname: 0.0.0.0
port: 9095
backend: pytorch
context_servers:
num_instances: 1
urls:
- "0.0.0.0:9091"
generation_servers:
num_instances: 1
urls:
- "0.0.0.0:9093"
HF_MODEL_DIR="Qwen3-30B-A3B"
# prefill
export CUDA_VISIBLE_DEVICES=0,1
trtllm-serve \
$HF_MODEL_DIR \
--host 0.0.0.0 --port 9091 \
--kv_cache_free_gpu_memory_fraction 0.1 --backend pytorch \
--extra_llm_api_options ./tp2.yaml &> log_ctx_0 &
# decode
export CUDA_VISIBLE_DEVICES=2
trtllm-serve \
$HF_MODEL_DIR \
--host 0.0.0.0 --port 9093 \
--kv_cache_free_gpu_memory_fraction 0.1 --backend pytorch \
--extra_llm_api_options ./tp1.yaml &> log_gen_0 &
# disaggregated server
trtllm-serve \
disaggregated -c disagg_config.yaml &> log_proxy &
Send request to the disaggregated server:
# client
curl http://localhost:9095/v1/completions -H "Content-Type: application/json" -d '{
"model": "./Qwen3-30B-A3B",
"prompt": "Tell me a joke",
"max_tokens": 128,
"temperature": 0
}' -w "\n"
I get
{"id":"cmpl-7fb5e2d87ae0465eb4d823baceb0956e","object":"text_completion","created":1753946566,"model":"./Qwen3-30B-A3B","choices":[{"index":0,"text":" about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about","token_ids":null,"logprobs":null,"context_logits":null,"finish_reason":"length","stop_reason":null,"disaggregated_params":{"request_type":"generation_only","first_gen_tokens":[911],"ctx_request_id":2052,"encoded_opaque_state":"AQAAAAACAAAAAAAAAAIAAAAAAAAAHcMMAAAAAAAAADE3Mi4yNi40Ni45NmuWDAAAAAAAAAAxNzIuMjYuNDYuOTYBMAAAAAAAAAACAAAAAgAAAAIAAAACAAAAAgAAAAIAAAACAAAAAgAAAAIAAAACAAAAAgAAAAIAAAACAAAAAgAAAAIAAAACAAAAAgAAAAIAAAACAAAAAgAAAAIAAAACAAAAAgAAAAIAAAACAAAAAgAAAAIAAAACAAAAAgAAAAIAAAACAAAAAgAAAAIAAAACAAAAAgAAAAIAAAACAAAAAgAAAAIAAAACAAAAAgAAAAIAAAACAAAAAgAAAAIAAAACAAAAAgAAAAIAAACAAAAAIAAAAAIAAAABAAAAAAAAAAACAAAABwAAAAACAAAA","draft_tokens":null}}],"usage":{"prompt_tokens":4,"total_tokens":132,"completion_tokens":128},"prompt_token_ids":null}
Send request to the prefill/decode server:
# client
curl http://localhost:9091/v1/completions -H "Content-Type: application/json" -d '{
"model": "./Qwen3-30B-A3B",
"prompt": "Tell me a joke",
"max_tokens": 128,
"temperature": 0
}' -w "\n"
I get
{"id":"cmpl-9b294cfc206f4f228752f47617f4d52f","object":"text_completion","created":1753948266,"model":"./Qwen3-30B-A3B","choices":[{"index":0,"text":" about a cat and a dog.\n\nOkay, I need to come up with a joke about a cat and a dog. Let me think... Jokes usually have a setup and a punchline. Maybe start with something about their typical behaviors. Cats are often seen as aloof, and dogs as friendly. Maybe play on that.\n\nWhat if the cat is doing something the dog doesn't understand? Like the cat knocking things over. The dog might try to help but mess things up. Or maybe a play on words. \"Why did the cat refuse to play with the dog? Because it was too feline.\" Wait, that's a","token_ids":null,"logprobs":null,"context_logits":null,"finish_reason":"length","stop_reason":null,"disaggregated_params":null}],"usage":{"prompt_tokens":4,"total_tokens":132,"completion_tokens":128},"prompt_token_ids":null}
The output tokens after disaggregation are different from those generated by raw prefill/decode servers, and are weird. Maybe some bugs exsit.
In addition, the disaggregated server can generate right tokens for SOME prompts, for example
curl http://localhost:9095/v1/completions -H "Content-Type: application/json" -d '{
"model": "./Qwen3-30B-A3B",
"prompt": "NVIDIA is a great company because",
"max_tokens": 128,
"temperature": 0
}' -w "\n"
I get
{"id":"cmpl-e7dca8af00cb46c4b913387ac853ac09","object":"text_completion","created":1753948700,"model":"./Qwen3-30B-A3B","choices":[{"index":0,"text":" it **it is a company that has been around for a long time and has a lot of experience in the field of technology**. This experience has allowed them to build a strong reputation and a loyal customer base. Additionally, the company has a **strong brand name and a wide range of products**, which makes it a **reliable and trustworthy** choice for consumers. Furthermore, the company has a **strong financial position**, which allows them to invest in research and development, ensuring that they stay at the forefront of technological innovation. Finally, the company has a **strong presence in the global market**, which allows them to reach a wide audience","token_ids":null,"logprobs":null,"context_logits":null,"finish_reason":"length","stop_reason":null,"disaggregated_params":{"request_type":"generation_only","first_gen_tokens":[432],"ctx_request_id":2055,"encoded_opaque_state":"AQAAAAACAAAAAAAAAAIAAAAAAAAAHcMMAAAAAAAAADE3Mi4yNi40Ni45NmuWDAAAAAAAAAAxNzIuMjYuNDYuOTYBMAAAAAAAAAACAAAAAgAAAAIAAAACAAAAAgAAAAIAAAACAAAAAgAAAAIAAAACAAAAAgAAAAIAAAACAAAAAgAAAAIAAAACAAAAAgAAAAIAAAACAAAAAgAAAAIAAAACAAAAAgAAAAIAAAACAAAAAgAAAAIAAAACAAAAAgAAAAIAAAACAAAAAgAAAAIAAAACAAAAAgAAAAIAAAACAAAAAgAAAAIAAAACAAAAAgAAAAIAAAACAAAAAgAAAAIAAAACAAAAAgAAAAIAAACAAAAAIAAAAAIAAAABAAAAAAAAAAACAAAABwAAAAACAAAA","draft_tokens":null}}],"usage":{"prompt_tokens":7,"total_tokens":135,"completion_tokens":128},"prompt_token_ids":null}
Expected behavior
output tokens from disaggregated server are valid (and similar to those from raw prefill/decode servers).
actual behavior
Please refer to the description in reproduction steps.
additional notes
None