vLLM Neuron Plugin (Beta)

⚠️ Important: This is beta preview of the vLLM Neuron plugin. For a more stable experience, consider using the AWS Neuron vllm fork as described in the NxDI vLLM User Guide.

The vLLM Neuron plugin (vllm-neuron) is a vLLM extension that integrates AWS Neuron Trainium/Inferentia support with vLLM. Built on vLLM's Plugin System, it enables the optimization of existing vLLM workflows on AWS Neuron.

vLLM v0.10.2 is automatically installed as part of this installation. Refer to the Quickstart Guide below.

Prerequisites

AWS Neuron SDK 2.26 (Release Notes)
Python 3.8+ (compatible with vLLM requirements)
Supported AWS instances: Inf2, Trn1/Trn1n, Trn2

Quickstart Guide

Install the plugin from GitHub sources using the following commands. The plugin will automatically install the correct version of vLLM along with other required dependencies.

git clone https://github.com/vllm-project/vllm-neuron.git
cd vllm-neuron
pip install --extra-index-url=https://pip.repos.neuron.amazonaws.com -e .

Basic Usage

Offline Inference

import os
from vllm import LLM, SamplingParams

# Initialize the model
llm = LLM(
    model=f"TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    max_num_seqs=4,
    max_model_len=2048,
    tensor_parallel_size=32,
    num_gpu_blocks_override=4096,
    block_size=32,
    enable_prefix_caching=True,
    additional_config=dict(
        override_neuron_config=dict(
            async_mode=False,
            is_prefix_caching=True,
            is_block_kv_layout=True,
            pa_num_blocks=4096,
            pa_block_size=32,
            skip_warmup=True,
            save_sharded_checkpoint=True,
        )
    ),
)

# Generate text
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
]
sampling_params = SamplingParams(temperature=0.0)
outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    print(f"Prompt: {output.prompt}")
    print(f"Generated: {output.outputs[0].text}")

OpenAI-Compatible API Server

python3 -m vllm.entrypoints.openai.api_server \
    --model "TinyLlama/TinyLlama-1.1B-Chat-v1.0" \
    --tensor-parallel-size 32 \
    --max-model-len 2048 \
    --max-num-seqs 4 \
    --no-enable-prefix-caching \
    --additional-config '{
        "override_neuron_config": {
            "skip_warmup": true,
            "enable_bucketing": true,
            "context_encoding_buckets": [256, 512, 1024, 2048],
            "token_generation_buckets": [256, 512, 1024, 2048]
        }
    }' \
    --port 8000

Feature Support

Feature	Status	Notes
Prefix Caching	🟢
Speculative Decoding	🟢	Only Eagle V1 is supported
Quantization	🟢	INT8/FP8 quantization support
Chunked Prefill	🚧
Multimodal	🚧	Only Llama 4 is supported

🟢 Functional: Fully operational, with ongoing optimizations.
🚧 WIP: Under active development.

Feature Configuration

You configure Neuron-specific features using the NxD Inference library. Use the additional_config field to provide an override_neuron_config dict that specifies your desired NxD Inference configurations.

Models Supported

We support a subset of models supported on NxDI, including

Llama 2/3.1/3.3
Llama 4 Scout, Maverick
Qwen 2.5
Qwen 3

Known Issues

Chunked prefill is disabled by default on Neuron for optimal performance. To enable chunked prefill, set the environment variable DISABLE_NEURON_CUSTOM_SCHEDULER="1". Users are required to provide a num_gpu_blocks_override arg calculated as ceil(max_model_len // block_size) * max_num_seqs when invoking vllm to avoid a potential OOB error.

Support

Documentation: AWS Neuron Documentation
Issues: GitHub Issues
Community: AWS Neuron Forum

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 78 Commits
requirements		requirements
test		test
vllm_neuron		vllm_neuron
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

vLLM Neuron Plugin (Beta)

Prerequisites

Quickstart Guide

Basic Usage

Offline Inference

OpenAI-Compatible API Server

Feature Support

Feature Configuration

Models Supported

Known Issues

Support

License

About

Uh oh!

Releases 2

Packages

Contributors 7

Uh oh!

Languages

License

vllm-project/vllm-neuron

Folders and files

Latest commit

History

Repository files navigation

vLLM Neuron Plugin (Beta)

Prerequisites

Quickstart Guide

Basic Usage

Offline Inference

OpenAI-Compatible API Server

Feature Support

Feature Configuration

Models Supported

Known Issues

Support

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Contributors 7

Uh oh!

Languages

Packages