This image contains the granite-7b-lab-gguf model and llama.cpp to serve the model with CUDA support. The container can run in two modes, CLI to chat with the model, or serving the model for OpenAI API clients
- Docker or Podman
- CUDA Toolkit, tested with v12.6
- nvidia container toolkit
https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html
curl -s -L https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo |
sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo
sudo dnf install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
sudo nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml
docker run --runtime=nvidia --gpus all --ipc=host -it granite-7b-lab-gguf-cuda
podman run --device nvidia.com/gpu=all --ipc=host -it granite-7b-lab-gguf-cuda
docker run --runtime=nvidia --gpus all --network host --ipc=host -it granite-7b-lab-gguf-cuda -s
podman run --device nvidia.com/gpu=all --network host --ipc=host -it granite-7b-lab-gguf-cuda -s
Arguments can be passed to llama.cpp by using environment variables prefixed with "LLAMA_".
For example, to pass the "--port" argument, set the environment variable "LLAMA_PORT" when running the container e.g.
docker run --runtime=nvidia --gpus all --network host --ipc=host -it -e "LLAMA_PORT=8090" granite-7b-lab-gguf-cuda -s
This will start the llama.cpp server listening on port 8090.
For a full list of llama.cpp arguments refer to the llama.cpp documentation:
curl http://0.0.0.0:8080/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "/instructlab/share/models/instructlab/granite-7b-lab",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant. "
},
{
"role": "user", "content": "Tell me a story about a red car"
}
]
}'