Skip to content

Trtllm-pytorch doesn't support n > 1 #6406

@foreverlms

Description

@foreverlms

System Info

Trtllm version: v0.20.0

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Demo script:

from tensorrt_llm._torch import LLM
from tensorrt_llm import SamplingParams

from tensorrt_llm._torch.pyexecutor.config import PyTorchConfig


def main():
    # Model could accept HF model name, a path to local HF model,
    # or TensorRT Model Optimizer's quantized checkpoints like nvidia/Llama-3.1-8B-Instruct-FP8 on HF.
    model_path = "Qwen/Qwen3-30B-A3B"
    # model_path = "Qwen/Qwen3-235B-A22B"
    tp = 2
    pytorch_backend_config = PyTorchConfig(disable_overlap_scheduler=True)
    llm = LLM(
        model=model_path,
        tensor_parallel_size=tp,
        moe_tensor_parallel_size=1,
        moe_expert_parallel_size=tp,
        max_num_tokens=1160,
        max_batch_size=161,
        free_gpu_memory_fraction=0.8,
        pytorch_backend_config=pytorch_backend_config,
    )

    # Sample prompts.
    prompts = [
        "Hello, my name is",
    ]

    # Create a sampling params.
    sampling_params = SamplingParams(temperature=0.8, top_p=0.95, n=2)

    for output in llm.generate(prompts, sampling_params):
        print(f"Prompt: {output.prompt!r}, Generated text: {output.outputs[0].text!r}")


if __name__ == "__main__":
    main()

Expected behavior

Samping 2 results for each prompt

actual behavior

Errors happened.

Exception in thread Thread-4 (_executor_loop):
Traceback (most recent call last):
  File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
Exception in thread Thread-4 (_executor_loop):
Traceback (most recent call last):
  File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/root/nvda/TensorRT-LLM/tensorrt_llm/_torch/pyexecutor/py_executor.py", line 836, in _executor_loop
    assert scheduled_batch.batch_size > 0, (
AssertionError: fail to schedule any pending request, probably run out of resource.
    self.run()
  File "/usr/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/root/nvda/TensorRT-LLM/tensorrt_llm/_torch/pyexecutor/py_executor.py", line 836, in _executor_loop
    assert scheduled_batch.batch_size > 0, (
AssertionError: fail to schedule any pending request, probably run out of resource.

additional notes

No more.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Decoding<NV>Token sampling algorithms in TRTLLM for text gen (top-k, top-p, beam).LLM API<NV>High-level LLM Python API & tools (e.g., trtllm-llmapi-launch) for TRTLLM inference/workflows.bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions