dynamo-run
is a CLI tool for exploring the Dynamo components, and an example of how to use them from Rust. It is also available as dynamo run
if using the Python wheel.
If you used pip
to install dynamo
you should have the dynamo-run
binary pre-installed with the vllm
engine. You must be in a virtual env with vllm installed to use this. For more options see "Full documentation" below.
Automatically download a model from Hugging Face
This will automatically download Qwen2.5 3B from Hugging Face (6 GiB download) and start it in interactive text mode:
dynamo run out=vllm Qwen/Qwen2.5-3B-Instruct
General format for HF download:
dynamo run out=<engine> <HUGGING_FACE_ORGANIZATION/MODEL_NAME>
For gated models (e.g. meta-llama/Llama-3.2-3B-Instruct) you have to have an HF_TOKEN
environment variable set.
The parameter can be the ID of a HuggingFace repository (it will be downloaded), a GGUF file, or a folder containing safetensors, config.json, etc (a locally checked out HuggingFace repository).
One of these models should be high quality and fast on almost any machine: https://huggingface.co/bartowski/Llama-3.2-3B-Instruct-GGUF E.g. https://huggingface.co/bartowski/Llama-3.2-3B-Instruct-GGUF/blob/main/Llama-3.2-3B-Instruct-Q4_K_M.gguf
Download model file:
curl -L -o Llama-3.2-3B-Instruct-Q4_K_M.gguf "https://huggingface.co/bartowski/Llama-3.2-3B-Instruct-GGUF/resolve/main/Llama-3.2-3B-Instruct-Q4_K_M.gguf?download=true"
Text interface
dynamo run out=vllm Llama-3.2-3B-Instruct-Q4_K_M.gguf # or path to a Hugging Face repo checkout instead of the GGUF
HTTP interface
dynamo run in=http out=vllm Llama-3.2-3B-Instruct-Q4_K_M.gguf
List the models
curl localhost:8080/v1/models
Send a request
curl -d '{"model": "Llama-3.2-3B-Instruct-Q4_K_M", "max_completion_tokens": 2049, "messages":[{"role":"user", "content": "What is the capital of South Africa?" }]}' -H 'Content-Type: application/json' http://localhost:8080/v1/chat/completions
Multi-node
You will need etcd and nats installed and accessible from both nodes.
Node 1:
dynamo run in=http out=dyn://llama3B_pool
Node 2:
dynamo run in=dyn://llama3B_pool out=vllm ~/llm_models/Llama-3.2-3B-Instruct
This will use etcd to auto-discover the model and NATS to talk to it. You can run multiple workers on the same endpoint and it will pick one at random each time.
The llama3B_pool
name is purely symbolic, pick anything as long as it matches the other node.
Run dynamo run --help
for more options.
dynamo-run
is what dynamo run
executes. It is an example of what you can build in Rust with the dynamo-llm
and dynamo-runtime
. Here is a list of how to build from source and all the features.
Libraries Ubuntu:
apt install -y build-essential libhwloc-dev libudev-dev pkg-config libssl-dev libclang-dev protobuf-compiler python3-dev
Libraries macOS:
brew install cmake protobuf
# install Xcode from App Store and check that Metal is accessible
xcrun -sdk macosx metal
# may have to install Xcode Command Line Tools:
xcode-select --install
Install Rust:
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
source $HOME/.cargo/env
Navigate to launch/ directory
cd launch/
Optionally can run cargo build
from any location with arguments:
--target-dir /path/to/target_directory` specify target_directory with write privileges
--manifest-path /path/to/project/Cargo.toml` if cargo build is run outside of `launch/` directory
- Linux with GPU and CUDA (tested on Ubuntu):
cargo build --release --features mistralrs,cuda
- macOS with Metal:
cargo build --release --features mistralrs,metal
- CPU only:
cargo build --release --features mistralrs
The binary will be called dynamo-run
in target/release
cd target/release
- Setup the python virtual env:
uv venv
source .venv/bin/activate
uv pip install pip
uv pip install sgl-kernel --force-reinstall --no-deps
uv pip install "sglang[all]==0.4.2" --find-links https://flashinfer.ai/whl/cu124/torch2.4/flashinfer/
- Build
cargo build --release --features sglang
- Run
Any example above using out=sglang
will work, but our sglang backend is also multi-gpu and multi-node.
Node 1:
dynamo-run in=http out=sglang --model-path ~/llm_models/DeepSeek-R1-Distill-Llama-70B/ --tensor-parallel-size 8 --num-nodes 2 --node-rank 0 --dist-init-addr 10.217.98.122:9876
Node 2:
dynamo-run in=none out=sglang --model-path ~/llm_models/DeepSeek-R1-Distill-Llama-70B/ --tensor-parallel-size 8 --num-nodes 2 --node-rank 1 --dist-init-addr 10.217.98.122:9876
-
cargo build --release --features llamacpp,cuda
-
dynamo-run out=llama_cpp --model-path ~/llm_models/Llama-3.2-3B-Instruct-Q6_K.gguf --model-config ~/llm_models/Llama-3.2-3B-Instruct/
The extra --model-config
flag is because:
- llama_cpp only runs GGUF
- We send it tokens, meaning we do the tokenization ourself, so we need a tokenizer
- We don't yet read it out of the GGUF (TODO), so we need an HF repo with
tokenizer.json
et al
If the build step also builds llama_cpp libraries into the same folder as the binary ("libllama.so", "libggml.so", "libggml-base.so", "libggml-cpu.so", "libggml-cuda.so"), then dynamo-run
will need to find those at runtime. Set LD_LIBRARY_PATH
, and be sure to deploy them alongside the dynamo-run
binary.
Using the vllm Python library. We only use the back half of vllm, talking to it over zmq
. Slow startup, fast inference. Supports both safetensors from HF and GGUF files.
We use uv but any virtualenv manager should work.
Setup:
uv venv
source .venv/bin/activate
uv pip install pip
uv pip install vllm==0.7.3 setuptools
Note: If you're on Ubuntu 22.04 or earlier, you will need to add --python=python3.10
to your uv venv
command
Build:
cargo build --release --features vllm
Run (still inside that virtualenv) - HF repo:
./dynamo-run in=http out=vllm --model-path ~/llm_models/Llama-3.2-3B-Instruct/
Run (still inside that virtualenv) - GGUF:
./dynamo-run in=http out=vllm --model-path ~/llm_models/Llama-3.2-3B-Instruct-Q6_K.gguf --model-config ~/llm_models/Llama-3.2-3B-Instruct/
- Multi-node:
Node 1:
dynamo-run in=text out=vllm ~/llm_models/Llama-3.2-3B-Instruct/ --tensor-parallel-size 8 --num-nodes 2 --leader-addr 10.217.98.122:6539 --node-rank 0
Node 2:
dynamo-run in=none out=vllm ~/llm_models/Llama-3.2-3B-Instruct/ --num-nodes 2 --leader-addr 10.217.98.122:6539 --node-rank 1
You can provide your own engine in a Python file. The file must provide a generator with this signature:
async def generate(request):
Build: cargo build --release --features python
If the Python engine wants to receive and returns strings - it will do the prompt templating and tokenization itself - run it like this:
dynamo-run out=pystr:/home/user/my_python_engine.py
- The
request
parameter is a map, an OpenAI compatible create chat completion request: https://platform.openai.com/docs/api-reference/chat/create - The function must
yield
a series of maps conforming to create chat completion stream response (example below). - If using an HTTP front-end add the
--model-name
flag. This is the name we serve the model under.
The file is loaded once at startup and kept in memory.
Example engine:
import asyncio
async def generate(request):
yield {"id":"1","choices":[{"index":0,"delta":{"content":"The","role":"assistant"}}],"created":1841762283,"model":"Llama-3.2-3B-Instruct","system_fingerprint":"local","object":"chat.completion.chunk"}
await asyncio.sleep(0.1)
yield {"id":"1","choices":[{"index":0,"delta":{"content":" capital","role":"assistant"}}],"created":1841762283,"model":"Llama-3.2-3B-Instruct","system_fingerprint":"local","object":"chat.completion.chunk"}
await asyncio.sleep(0.1)
yield {"id":"1","choices":[{"index":0,"delta":{"content":" of","role":"assistant"}}],"created":1841762283,"model":"Llama-3.2-3B-Instruct","system_fingerprint":"local","object":"chat.completion.chunk"}
await asyncio.sleep(0.1)
yield {"id":"1","choices":[{"index":0,"delta":{"content":" France","role":"assistant"}}],"created":1841762283,"model":"Llama-3.2-3B-Instruct","system_fingerprint":"local","object":"chat.completion.chunk"}
await asyncio.sleep(0.1)
yield {"id":"1","choices":[{"index":0,"delta":{"content":" is","role":"assistant"}}],"created":1841762283,"model":"Llama-3.2-3B-Instruct","system_fingerprint":"local","object":"chat.completion.chunk"}
await asyncio.sleep(0.1)
yield {"id":"1","choices":[{"index":0,"delta":{"content":" Paris","role":"assistant"}}],"created":1841762283,"model":"Llama-3.2-3B-Instruct","system_fingerprint":"local","object":"chat.completion.chunk"}
await asyncio.sleep(0.1)
yield {"id":"1","choices":[{"index":0,"delta":{"content":".","role":"assistant"}}],"created":1841762283,"model":"Llama-3.2-3B-Instruct","system_fingerprint":"local","object":"chat.completion.chunk"}
await asyncio.sleep(0.1)
yield {"id":"1","choices":[{"index":0,"delta":{"content":"","role":"assistant"},"finish_reason":"stop"}],"created":1841762283,"model":"Llama-3.2-3B-Instruct","system_fingerprint":"local","object":"chat.completion.chunk"}
Command line arguments are passed to the python engine like this:
dynamo-run out=pystr:my_python_engine.py -- -n 42 --custom-arg Orange --yes
The python engine receives the arguments in sys.argv
. The argument list will include some standard ones as well as anything after the --
.
This input:
dynamo-run out=pystr:my_engine.py /opt/models/Llama-3.2-3B-Instruct/ --model-name llama_3.2 --tensor-parallel-size 4 -- -n 1
is read like this:
async def generate(request):
.. as before ..
if __name__ == "__main__":
print(f"MAIN: {sys.argv}")
and produces this output:
MAIN: ['my_engine.py', '--model-path', '/opt/models/Llama-3.2-3B-Instruct/', '--model-name', 'llama3.2', '--http-port', '8080', '--tensor-parallel-size', '4', '--base-gpu-id', '0', '--num-nodes', '1', '--node-rank', '0', '-n', '1']
This allows quick iteration on the engine setup. Note how the -n
1
is included. Flags --leader-addr
and --model-config
will also be added if provided to dynamo-run
.
If the Python engine wants to receive and return tokens - the prompt templating and tokenization is already done - run it like this:
dynamo-run out=pytok:/home/user/my_python_engine.py --model-path <hf-repo-checkout>
- The request parameter is a map that looks like this:
{'token_ids': [128000, 128006, 9125, 128007, ... lots more ... ], 'stop_conditions': {'max_tokens': 8192, 'stop': None, 'stop_token_ids_hidden': [128001, 128008, 128009], 'min_tokens': None, 'ignore_eos': None}, 'sampling_options': {'n': None, 'best_of': None, 'presence_penalty': None, 'frequency_penalty': None, 'repetition_penalty': None, 'temperature': None, 'top_p': None, 'top_k': None, 'min_p': None, 'use_beam_search': None, 'length_penalty': None, 'seed': None}, 'eos_token_ids': [128001, 128008, 128009], 'mdc_sum': 'f1cd44546fdcbd664189863b7daece0f139a962b89778469e4cffc9be58ccc88', 'annotations': []}
- The
generate
function mustyield
a series of maps that look like this:
{"token_ids":[791],"tokens":None,"text":None,"cum_log_probs":None,"log_probs":None,"finish_reason":None}
- Command like flag
--model-path
which must point to a Hugging Face repo checkout containing thetokenizer.json
. The--model-name
flag is optional. If not provided we use the HF repo name (directory name) as the model name.
Example engine:
import asyncio
async def generate(request):
yield {"token_ids":[791]}
await asyncio.sleep(0.1)
yield {"token_ids":[6864]}
await asyncio.sleep(0.1)
yield {"token_ids":[315]}
await asyncio.sleep(0.1)
yield {"token_ids":[9822]}
await asyncio.sleep(0.1)
yield {"token_ids":[374]}
await asyncio.sleep(0.1)
yield {"token_ids":[12366]}
await asyncio.sleep(0.1)
yield {"token_ids":[13]}
pytok
supports the same ways of passing command line arguments as pystr
- initialize
or main
with sys.argv
.
TensorRT-LLM. Requires clang
and libclang-dev
.
Build:
cargo build --release --features trtllm
Run:
dynamo-run in=text out=trtllm --model-path /app/trtllm_engine/ --model-config ~/llm_models/Llama-3.2-3B-Instruct/
Note that TRT-LLM uses it's own .engine
format for weights. Repo models must be converted like so:
- Get the build container
docker run --gpus all -it nvcr.io/nvidian/nemo-llm/trtllm-engine-builder:0.2.0 bash
- Fetch the model and convert
mkdir /tmp/model
huggingface-cli download meta-llama/Llama-3.2-3B-Instruct --local-dir /tmp/model
python convert_checkpoint.py --model_dir /tmp/model/ --output_dir ./converted --dtype [float16|bfloat16|whatever you want] --tp_size X --pp_size Y
trtllm-build --checkpoint_dir ./converted --output_dir ./final/trtllm_engine --use_paged_context_fmha enable --gemm_plugin auto
The --model-path
you give to dynamo-run
must contain the config.json
(TRT-LLM's , not the model's) and rank0.engine
(plus other ranks if relevant).
- Execute TRT-LLM is a C++ library that must have been previously built and installed. It needs a lot of memory to compile. Gitlab builds a container you can try:
sudo docker run --gpus all -it -v /home/user:/outside-home gitlab-master.nvidia.com:5005/dl/ai-services/libraries/rust/nim-nvllm/tensorrt_llm_runtime:85fa4a6f
Copy the trt-llm engine, the model's .json
files (for the model deployment card) and the nio
binary built for the correct glibc (container is Ubuntu 22.04 currently) into that container.
Dynamo includes two echo engines for testing and debugging purposes:
The echo_core
engine accepts pre-processed requests and echoes the tokens back as the response. This is useful for testing pre-processing functionality as the response will include the full prompt template.
dynamo-run in=http out=echo_core --model-path <hf-repo-checkout>
The echo_full
engine accepts un-processed requests and echoes the prompt back as the response.
dynamo-run in=http out=echo_full --model-name my_model
Both echo engines use a configurable delay between tokens to simulate generation speed. You can adjust this using the DYN_TOKEN_ECHO_DELAY_MS
environment variable:
# Set token echo delay to 1ms (1000 tokens per second)
DYN_TOKEN_ECHO_DELAY_MS=1 dynamo-run in=http out=echo_full
The default delay is 10ms, which produces approximately 100 tokens per second.
dynamo-run can take a jsonl file full of prompts and evaluate them all:
dynamo-run in=batch:prompts.jsonl out=llamacpp <model>
The input file should look like this:
{"text": "What is the capital of France?"}
{"text": "What is the capital of Spain?"}
Each one is passed as a prompt to the model. The output is written back to the same folder in output.jsonl
. At the end of the run some statistics are printed.
The output looks like this:
{"text":"What is the capital of France?","response":"The capital of France is Paris.","tokens_in":7,"tokens_out":7,"elapsed_ms":1566}
{"text":"What is the capital of Spain?","response":".The capital of Spain is Madrid.","tokens_in":7,"tokens_out":7,"elapsed_ms":855}
The input defaults to in=text
. The output will default to mistralrs
engine. If not available whatever engine you have compiled in (so depending on --features
).