-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Open
Labels
Decoding<NV>Token sampling algorithms in TRTLLM for text gen (top-k, top-p, beam).<NV>Token sampling algorithms in TRTLLM for text gen (top-k, top-p, beam).LLM API<NV>High-level LLM Python API & tools (e.g., trtllm-llmapi-launch) for TRTLLM inference/workflows.<NV>High-level LLM Python API & tools (e.g., trtllm-llmapi-launch) for TRTLLM inference/workflows.bugSomething isn't workingSomething isn't working
Description
System Info
Trtllm version: v0.20.0
Who can help?
No response
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
Demo script:
from tensorrt_llm._torch import LLM
from tensorrt_llm import SamplingParams
from tensorrt_llm._torch.pyexecutor.config import PyTorchConfig
def main():
# Model could accept HF model name, a path to local HF model,
# or TensorRT Model Optimizer's quantized checkpoints like nvidia/Llama-3.1-8B-Instruct-FP8 on HF.
model_path = "Qwen/Qwen3-30B-A3B"
# model_path = "Qwen/Qwen3-235B-A22B"
tp = 2
pytorch_backend_config = PyTorchConfig(disable_overlap_scheduler=True)
llm = LLM(
model=model_path,
tensor_parallel_size=tp,
moe_tensor_parallel_size=1,
moe_expert_parallel_size=tp,
max_num_tokens=1160,
max_batch_size=161,
free_gpu_memory_fraction=0.8,
pytorch_backend_config=pytorch_backend_config,
)
# Sample prompts.
prompts = [
"Hello, my name is",
]
# Create a sampling params.
sampling_params = SamplingParams(temperature=0.8, top_p=0.95, n=2)
for output in llm.generate(prompts, sampling_params):
print(f"Prompt: {output.prompt!r}, Generated text: {output.outputs[0].text!r}")
if __name__ == "__main__":
main()
Expected behavior
Samping 2 results for each prompt
actual behavior
Errors happened.
Exception in thread Thread-4 (_executor_loop):
Traceback (most recent call last):
File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
Exception in thread Thread-4 (_executor_loop):
Traceback (most recent call last):
File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
self.run()
File "/usr/lib/python3.10/threading.py", line 953, in run
self._target(*self._args, **self._kwargs)
File "/root/nvda/TensorRT-LLM/tensorrt_llm/_torch/pyexecutor/py_executor.py", line 836, in _executor_loop
assert scheduled_batch.batch_size > 0, (
AssertionError: fail to schedule any pending request, probably run out of resource.
self.run()
File "/usr/lib/python3.10/threading.py", line 953, in run
self._target(*self._args, **self._kwargs)
File "/root/nvda/TensorRT-LLM/tensorrt_llm/_torch/pyexecutor/py_executor.py", line 836, in _executor_loop
assert scheduled_batch.batch_size > 0, (
AssertionError: fail to schedule any pending request, probably run out of resource.
additional notes
No more.
varuniyer
Metadata
Metadata
Assignees
Labels
Decoding<NV>Token sampling algorithms in TRTLLM for text gen (top-k, top-p, beam).<NV>Token sampling algorithms in TRTLLM for text gen (top-k, top-p, beam).LLM API<NV>High-level LLM Python API & tools (e.g., trtllm-llmapi-launch) for TRTLLM inference/workflows.<NV>High-level LLM Python API & tools (e.g., trtllm-llmapi-launch) for TRTLLM inference/workflows.bugSomething isn't workingSomething isn't working