Skip to content

[BUG][Disaggregated] Wrong outputs when prefill/decode uses different tp_size  #6507

@ZhangGe6

Description

@ZhangGe6

System Info

GPU: A100
TensorRT-LLM version: 1.0.0rc4 (I am using the prebuilt container)

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

I get wrong results after prefill/decode disaggregation (prefill.tp_size == 2 and decode.tp_size == 1). Here are the reproduce steps:

Launch container by:

docker run -v $PWD:/mnt    -v /aisw:/aisw   -e EXEC_BASH=1 --net=host -v $PWD:/mnt -w /mnt  --pid=host --rm -it  --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --gpus=all nvcr.io/nvidia/tensorrt-llm/release:1.0.0rc4  /bin/bash

Launch the servers:

# tp2.yaml
attn_backend: FLASHINFER
moe_config:
  backend: CUTLASS
disable_overlap_scheduler: True
cache_transceiver_config:
  backend: ucx
  max_tokens_in_buffer: 2048
tensor_parallel_size: 2
# tp1.yaml
attn_backend: FLASHINFER
moe_config:
  backend: CUTLASS
disable_overlap_scheduler: True
cache_transceiver_config:
  backend: ucx
  max_tokens_in_buffer: 2048
# disagg_config.yaml
hostname: 0.0.0.0
port: 9095
backend: pytorch
context_servers:
  num_instances: 1
  urls:
      - "0.0.0.0:9091"
generation_servers:
  num_instances: 1
  urls:
      - "0.0.0.0:9093"
HF_MODEL_DIR="Qwen3-30B-A3B"

# prefill
export CUDA_VISIBLE_DEVICES=0,1
trtllm-serve \
    $HF_MODEL_DIR \
    --host 0.0.0.0 --port 9091 \
    --kv_cache_free_gpu_memory_fraction 0.1 --backend pytorch \
    --extra_llm_api_options ./tp2.yaml &> log_ctx_0 &


# decode
export CUDA_VISIBLE_DEVICES=2
trtllm-serve \
    $HF_MODEL_DIR \
    --host 0.0.0.0 --port 9093 \
    --kv_cache_free_gpu_memory_fraction 0.1 --backend pytorch \
    --extra_llm_api_options ./tp1.yaml &> log_gen_0 &

# disaggregated server
trtllm-serve \
    disaggregated -c disagg_config.yaml &> log_proxy &

Send request to the disaggregated server:

# client
curl http://localhost:9095/v1/completions     -H "Content-Type: application/json"     -d '{
        "model": "./Qwen3-30B-A3B",
        "prompt": "Tell me a joke",
        "max_tokens": 128,
        "temperature": 0
    }' -w "\n"

I get

{"id":"cmpl-7fb5e2d87ae0465eb4d823baceb0956e","object":"text_completion","created":1753946566,"model":"./Qwen3-30B-A3B","choices":[{"index":0,"text":" about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about about","token_ids":null,"logprobs":null,"context_logits":null,"finish_reason":"length","stop_reason":null,"disaggregated_params":{"request_type":"generation_only","first_gen_tokens":[911],"ctx_request_id":2052,"encoded_opaque_state":"AQAAAAACAAAAAAAAAAIAAAAAAAAAHcMMAAAAAAAAADE3Mi4yNi40Ni45NmuWDAAAAAAAAAAxNzIuMjYuNDYuOTYBMAAAAAAAAAACAAAAAgAAAAIAAAACAAAAAgAAAAIAAAACAAAAAgAAAAIAAAACAAAAAgAAAAIAAAACAAAAAgAAAAIAAAACAAAAAgAAAAIAAAACAAAAAgAAAAIAAAACAAAAAgAAAAIAAAACAAAAAgAAAAIAAAACAAAAAgAAAAIAAAACAAAAAgAAAAIAAAACAAAAAgAAAAIAAAACAAAAAgAAAAIAAAACAAAAAgAAAAIAAAACAAAAAgAAAAIAAAACAAAAAgAAAAIAAACAAAAAIAAAAAIAAAABAAAAAAAAAAACAAAABwAAAAACAAAA","draft_tokens":null}}],"usage":{"prompt_tokens":4,"total_tokens":132,"completion_tokens":128},"prompt_token_ids":null}

Send request to the prefill/decode server:

# client
curl http://localhost:9091/v1/completions     -H "Content-Type: application/json"     -d '{
        "model": "./Qwen3-30B-A3B",
        "prompt": "Tell me a joke",
        "max_tokens": 128,
        "temperature": 0
    }' -w "\n"

I get

{"id":"cmpl-9b294cfc206f4f228752f47617f4d52f","object":"text_completion","created":1753948266,"model":"./Qwen3-30B-A3B","choices":[{"index":0,"text":" about a cat and a dog.\n\nOkay, I need to come up with a joke about a cat and a dog. Let me think... Jokes usually have a setup and a punchline. Maybe start with something about their typical behaviors. Cats are often seen as aloof, and dogs as friendly. Maybe play on that.\n\nWhat if the cat is doing something the dog doesn't understand? Like the cat knocking things over. The dog might try to help but mess things up. Or maybe a play on words. \"Why did the cat refuse to play with the dog? Because it was too feline.\" Wait, that's a","token_ids":null,"logprobs":null,"context_logits":null,"finish_reason":"length","stop_reason":null,"disaggregated_params":null}],"usage":{"prompt_tokens":4,"total_tokens":132,"completion_tokens":128},"prompt_token_ids":null}

The output tokens after disaggregation are different from those generated by raw prefill/decode servers, and are weird. Maybe some bugs exsit.

In addition, the disaggregated server can generate right tokens for SOME prompts, for example

curl http://localhost:9095/v1/completions     -H "Content-Type: application/json"     -d '{
        "model": "./Qwen3-30B-A3B",
        "prompt": "NVIDIA is a great company because",
        "max_tokens": 128,
        "temperature": 0
    }' -w "\n"

I get

{"id":"cmpl-e7dca8af00cb46c4b913387ac853ac09","object":"text_completion","created":1753948700,"model":"./Qwen3-30B-A3B","choices":[{"index":0,"text":" it **it is a company that has been around for a long time and has a lot of experience in the field of technology**. This experience has allowed them to build a strong reputation and a loyal customer base. Additionally, the company has a **strong brand name and a wide range of products**, which makes it a **reliable and trustworthy** choice for consumers. Furthermore, the company has a **strong financial position**, which allows them to invest in research and development, ensuring that they stay at the forefront of technological innovation. Finally, the company has a **strong presence in the global market**, which allows them to reach a wide audience","token_ids":null,"logprobs":null,"context_logits":null,"finish_reason":"length","stop_reason":null,"disaggregated_params":{"request_type":"generation_only","first_gen_tokens":[432],"ctx_request_id":2055,"encoded_opaque_state":"AQAAAAACAAAAAAAAAAIAAAAAAAAAHcMMAAAAAAAAADE3Mi4yNi40Ni45NmuWDAAAAAAAAAAxNzIuMjYuNDYuOTYBMAAAAAAAAAACAAAAAgAAAAIAAAACAAAAAgAAAAIAAAACAAAAAgAAAAIAAAACAAAAAgAAAAIAAAACAAAAAgAAAAIAAAACAAAAAgAAAAIAAAACAAAAAgAAAAIAAAACAAAAAgAAAAIAAAACAAAAAgAAAAIAAAACAAAAAgAAAAIAAAACAAAAAgAAAAIAAAACAAAAAgAAAAIAAAACAAAAAgAAAAIAAAACAAAAAgAAAAIAAAACAAAAAgAAAAIAAAACAAAAAgAAAAIAAACAAAAAIAAAAAIAAAABAAAAAAAAAAACAAAABwAAAAACAAAA","draft_tokens":null}}],"usage":{"prompt_tokens":7,"total_tokens":135,"completion_tokens":128},"prompt_token_ids":null}

Expected behavior

output tokens from disaggregated server are valid (and similar to those from raw prefill/decode servers).

actual behavior

Please refer to the description in reproduction steps.

additional notes

None

Metadata

Metadata

Assignees

No one assigned

    Labels

    Disaggregated ServingDeploying TRTLLM with separated, distributed components (params, kv-cache, compute). Arch & perf.bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions