⚠️ Important: This is beta preview of the vLLM Neuron plugin. For a more stable experience, consider using the AWS Neuron vllm fork as described in the NxDI vLLM User Guide.
The vLLM Neuron plugin (vllm-neuron) is a vLLM extension that integrates AWS Neuron Trainium/Inferentia support with vLLM. Built on vLLM's Plugin System, it enables the optimization of existing vLLM workflows on AWS Neuron.
- vLLM v0.10.2 is automatically installed as part of this installation. Refer to the Quickstart Guide below.
- AWS Neuron SDK 2.26 (Release Notes)
- Python 3.8+ (compatible with vLLM requirements)
- Supported AWS instances: Inf2, Trn1/Trn1n, Trn2
Install the plugin from GitHub sources using the following commands. The plugin will automatically install the correct version of vLLM along with other required dependencies.
git clone https://github.com/vllm-project/vllm-neuron.git
cd vllm-neuron
pip install --extra-index-url=https://pip.repos.neuron.amazonaws.com -e .import os
from vllm import LLM, SamplingParams
# Initialize the model
llm = LLM(
model=f"TinyLlama/TinyLlama-1.1B-Chat-v1.0",
max_num_seqs=4,
max_model_len=2048,
tensor_parallel_size=32,
num_gpu_blocks_override=4096,
block_size=32,
enable_prefix_caching=True,
additional_config=dict(
override_neuron_config=dict(
async_mode=False,
is_prefix_caching=True,
is_block_kv_layout=True,
pa_num_blocks=4096,
pa_block_size=32,
skip_warmup=True,
save_sharded_checkpoint=True,
)
),
)
# Generate text
prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
]
sampling_params = SamplingParams(temperature=0.0)
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
print(f"Prompt: {output.prompt}")
print(f"Generated: {output.outputs[0].text}")python3 -m vllm.entrypoints.openai.api_server \
--model "TinyLlama/TinyLlama-1.1B-Chat-v1.0" \
--tensor-parallel-size 32 \
--max-model-len 2048 \
--max-num-seqs 4 \
--no-enable-prefix-caching \
--additional-config '{
"override_neuron_config": {
"skip_warmup": true,
"enable_bucketing": true,
"context_encoding_buckets": [256, 512, 1024, 2048],
"token_generation_buckets": [256, 512, 1024, 2048]
}
}' \
--port 8000| Feature | Status | Notes |
|---|---|---|
| Prefix Caching | 🟢 | |
| Speculative Decoding | 🟢 | Only Eagle V1 is supported |
| Quantization | 🟢 | INT8/FP8 quantization support |
| Chunked Prefill | 🚧 | |
| Multimodal | 🚧 | Only Llama 4 is supported |
- 🟢 Functional: Fully operational, with ongoing optimizations.
- 🚧 WIP: Under active development.
You configure Neuron-specific features using the NxD Inference library. Use the additional_config field to provide an override_neuron_config dict that specifies your desired NxD Inference configurations.
We support a subset of models supported on NxDI, including
- Llama 2/3.1/3.3
- Llama 4 Scout, Maverick
- Qwen 2.5
- Qwen 3
- Chunked prefill is disabled by default on Neuron for optimal performance. To enable chunked prefill, set the environment variable
DISABLE_NEURON_CUSTOM_SCHEDULER="1". Users are required to provide anum_gpu_blocks_overridearg calculated asceil(max_model_len // block_size) * max_num_seqswhen invoking vllm to avoid a potential OOB error.
- Documentation: AWS Neuron Documentation
- Issues: GitHub Issues
- Community: AWS Neuron Forum
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.