Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
40 changes: 40 additions & 0 deletions community/ai-vws-sizing-advisor/CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,46 @@ All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.


## [2.3] - 2026-01-08

This release focuses on improved sizing recommendations, enhanced Nemotron model integration, and comprehensive documentation updates.

### Added
- **Demo Screenshots** — Added visual examples showcasing the Configuration Wizard, RAG-powered sizing recommendations, and Local Deployment verification
- **Official Documentation Link** — Added link to [NVIDIA vGPU Docs Hub](https://docs.nvidia.com/vgpu/toolkits/sizing-advisor/latest/intro.html) in README

### Changed
- **README Overhaul** — Reorganized documentation to highlight NVIDIA Nemotron models
- Llama-3.3-Nemotron-Super-49B powers the RAG backend
- Nemotron-3 Nano 30B (FP8) as default for workload sizing
- New Demo section with screenshots demonstrating key features

- **Sizing Recommendation Improvements**
- Enhanced 95% usable capacity rule for profile selection (5% reserved for system overhead)
- Improved profile selection logic: picks smallest profile where (profile × 0.95) >= workload
- Better handling of edge cases near profile boundaries

- **GPU Passthrough Logic**
- Automatic passthrough recommendation when workload exceeds max single vGPU profile
- Clearer passthrough examples in RAG context (e.g., 92GB on BSE → 2× BSE GPU passthrough)
- Calculator now returns `vgpu_profile: null` with multi-GPU passthrough recommendation

- **vLLM Local Deployment**
- Updated to vLLM v0.12.0 for proper NemotronH (hybrid Mamba-Transformer) architecture support
- Improved GPU memory utilization calculations for local testing
- Better max-model-len auto-detection (only set when explicitly specified)

- **Chat Improvements**
- Enhanced conversational mode with vGPU configuration context
- Better model extraction from sizing responses for follow-up questions
- Improved context handling for RAG vs inference workload discussions

### Improved
- **Nemotron Model Integration**
- Default model changed to Nemotron-3 Nano 30B FP8 in configuration wizard
- Nemotron thinking prompt support for enhanced reasoning
- Better model matching for Nemotron variants in calculator

## [2.2] - 2025-11-04

### Changed
Expand Down
88 changes: 72 additions & 16 deletions community/ai-vws-sizing-advisor/README.md
Original file line number Diff line number Diff line change
@@ -1,18 +1,67 @@
# AI vWS Sizing Advisor

<p align="center">
<img src="deployment_examples/example_rag_config.png" alt="AI vWS Sizing Advisor" width="800">
</p>

<p align="center">
<strong>RAG-powered vGPU sizing recommendations for AI Virtual Workstations</strong><br>
Powered by NVIDIA NeMo™ and Nemotron models
</p>

<p align="center">
<a href="https://docs.nvidia.com/vgpu/toolkits/sizing-advisor/latest/intro.html">Official Documentation</a> •
<a href="#demo">Demo</a> •
<a href="#deployment">Quick Start</a> •
<a href="./CHANGELOG.md">Changelog</a>
</p>

---

## Overview

AI vWS Sizing Advisor is a RAG-powered tool that helps you determine the optimal NVIDIA vGPU sizing configuration for AI workloads on NVIDIA AI Virtual Workstation (AI vWS). Using NVIDIA vGPU documentation and best practices, it provides tailored recommendations for optimal performance and resource efficiency.

### Powered by NVIDIA Nemotron

This tool leverages **NVIDIA Nemotron models** for intelligent sizing recommendations:

- **[Llama-3.3-Nemotron-Super-49B](https://build.nvidia.com/nvidia/llama-3_3-nemotron-super-49b-v1)** — Powers the RAG backend for intelligent conversational sizing guidance
- **[Nemotron-3 Nano 30B](https://build.nvidia.com/nvidia/nvidia-nemotron-3-nano-30b-a3b-fp8)** — Default model for workload sizing calculations (FP8 optimized)

### Key Capabilities

Enter your workload requirements and receive validated recommendations including:

- **vGPU Profile** - Recommended profile (e.g., L40S-24Q) based on your workload
- **Resource Requirements** - vCPUs, GPU memory, system RAM needed
- **Performance Estimates** - Expected latency, throughput, and time to first token
- **Live Testing** - Instantly deploy and validate your configuration locally using vLLM containers
- **vGPU Profile** Recommended profile (e.g., L40S-24Q) based on your workload
- **Resource Requirements** vCPUs, GPU memory, system RAM needed
- **Performance Estimates** Expected latency, throughput, and time to first token
- **Live Testing** Instantly deploy and validate your configuration locally using vLLM containers

The tool differentiates between RAG and inference workloads by accounting for embedding vectors and database overhead. It intelligently suggests GPU passthrough when jobs exceed standard vGPU profile limits.

---

## Demo

### Configuration Wizard

Configure your workload parameters including model selection, GPU type, quantization, and token sizes:

<p align="center">
<img src="deployment_examples/configuration_wizard.png" alt="Configuration Wizard" width="700">
</p>

### Local Deployment Verification

Validate your configuration by deploying a vLLM container locally and comparing actual GPU memory usage against estimates:

<p align="center">
<img src="deployment_examples/local_deployment.png" alt="Local Deployment" width="700">
</p>

---

## Prerequisites

### Hardware
Expand Down Expand Up @@ -44,8 +93,10 @@ docker run --rm --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi
> **Note:** Docker must be at `/usr/bin/docker` (verified in `deploy/compose/docker-compose-rag-server.yaml`). User must be in docker group or have socket permissions.

### API Keys
- **NVIDIA Build API Key** (Required) - [Get your key](https://build.nvidia.com/settings/api-keys)
- **HuggingFace Token** (Optional) - [Create token](https://huggingface.co/settings/tokens) for gated models
- **NVIDIA Build API Key** (Required) — [Get your key](https://build.nvidia.com/settings/api-keys)
- **HuggingFace Token** (Optional) — [Create token](https://huggingface.co/settings/tokens) for gated models

---

## Deployment

Expand Down Expand Up @@ -74,28 +125,32 @@ npm install
npm run dev
```

---

## Usage

2. **Select Workload Type:** RAG or Inference
1. **Select Workload Type:** RAG or Inference

3. **Enter Parameters:**
- Model name (e.g., `meta-llama/Llama-2-7b-chat-hf`)
2. **Enter Parameters:**
- Model name (default: **Nemotron-3 Nano 30B FP8**)
- GPU type
- Prompt size (input tokens)
- Response size (output tokens)
- Quantization (FP16, INT8, INT4)
- Quantization (FP16, FP8, INT8, INT4)
- For RAG: Embedding model and vector dimensions

4. **View Recommendations:**
3. **View Recommendations:**
- Recommended vGPU profiles
- Resource requirements (vCPUs, RAM, GPU memory)
- Performance estimates

5. **Test Locally** (optional):
4. **Test Locally** (optional):
- Run local inference with a containerized vLLM server
- View performance metrics
- Compare actual results versus suggested profile configuration

---

## Management Commands

```bash
Expand All @@ -120,6 +175,8 @@ The stop script automatically performs Docker cleanup operations:
- Optionally removes dangling images (`--cleanup-images`)
- Optionally removes all data volumes (`--volumes`)

---

## Adding Documents to RAG Context

The tool includes NVIDIA vGPU documentation by default. To add your own:
Expand All @@ -134,8 +191,7 @@ curl -X POST -F "file=@./vgpu_docs/your-document.pdf" http://localhost:8082/v1/i

**Supported formats:** PDF, TXT, DOCX, HTML, PPTX



---

## License

Expand All @@ -145,6 +201,6 @@ Models governed by [NVIDIA AI Foundation Models Community License](https://docs.

---

**Version:** 2.2 (November 2025) - See [CHANGELOG.md](./CHANGELOG.md)
**Version:** 2.3 (January 2026) — See [CHANGELOG.md](./CHANGELOG.md)

**Support:** [GitHub Issues](https://github.com/NVIDIA/GenerativeAIExamples/issues) | [NVIDIA Forums](https://forums.developer.nvidia.com/)
**Support:** [GitHub Issues](https://github.com/NVIDIA/GenerativeAIExamples/issues) | [NVIDIA Forums](https://forums.developer.nvidia.com/) | [Official Docs](https://docs.nvidia.com/vgpu/toolkits/sizing-advisor/latest/intro.html)
Original file line number Diff line number Diff line change
@@ -1,3 +1,11 @@
# ============================================================================
# CENTRALIZED MODEL CONFIGURATION
# Change these values to use different models throughout the application
# ============================================================================
x-model-config:
# Embedding Model Configuration
embedding-model: &embedding-model "nvidia/llama-3.2-nemoretriever-1b-vlm-embed-v1"

services:

# Main ingestor server which is responsible for ingestion
Expand Down Expand Up @@ -38,10 +46,14 @@ services:
NGC_API_KEY: ${NGC_API_KEY:?"NGC_API_KEY is required"}

##===Embedding Model specific configurations===
# Model name - pulls from centralized config at top of file (can be overridden by env var)
APP_EMBEDDINGS_MODELNAME: *embedding-model
# url on which embedding model is hosted. If "", Nvidia hosted API is used
APP_EMBEDDINGS_SERVERURL: ${APP_EMBEDDINGS_SERVERURL-"nemoretriever-embedding-ms:8000"}
APP_EMBEDDINGS_MODELNAME: ${APP_EMBEDDINGS_MODELNAME:-nvidia/nv-embedqa-mistral-7b-v2}
APP_EMBEDDINGS_DIMENSIONS: ${APP_EMBEDDINGS_DIMENSIONS:-2048}
APP_EMBEDDINGS_SERVERURL: ${APP_EMBEDDINGS_SERVERURL:-"nemoretriever-embedding-ms:8000"}
# Embedding dimensions - IMPORTANT: Must match your embedding model!
# nvidia/llama-3.2-nemoretriever-1b-vlm-embed-v1: 4096
# nvidia/nv-embedqa-mistral-7b-v2: 2048
APP_EMBEDDINGS_DIMENSIONS: ${APP_EMBEDDINGS_DIMENSIONS:-4096}

##===NV-Ingest Connection Configurations=======
APP_NVINGEST_MESSAGECLIENTHOSTNAME: ${APP_NVINGEST_MESSAGECLIENTHOSTNAME:-"nv-ingest-ms-runtime"}
Expand Down Expand Up @@ -115,9 +127,10 @@ services:
- AUDIO_INFER_PROTOCOL=grpc
- CUDA_VISIBLE_DEVICES=0
- MAX_INGEST_PROCESS_WORKERS=${MAX_INGEST_PROCESS_WORKERS:-16}
- EMBEDDING_NIM_MODEL_NAME=${EMBEDDING_NIM_MODEL_NAME:-${APP_EMBEDDINGS_MODELNAME:-nvidia/nv-embedqa-7b-v2}}
# Embedding model - uses APP_EMBEDDINGS_MODELNAME which pulls from centralized config
- EMBEDDING_NIM_MODEL_NAME=${APP_EMBEDDINGS_MODELNAME:-nvidia/llama-3.2-nemoretriever-1b-vlm-embed-v1}
# Incase of self-hosted embedding model, use the endpoint url as - https://integrate.api.nvidia.com/v1
- EMBEDDING_NIM_ENDPOINT=${EMBEDDING_NIM_ENDPOINT:-${APP_EMBEDDINGS_SERVERURL-http://nemoretriever-embedding-ms:8000/v1}}
- EMBEDDING_NIM_ENDPOINT=${EMBEDDING_NIM_ENDPOINT:-http://nemoretriever-embedding-ms:8000/v1}
- INGEST_LOG_LEVEL=DEFAULT
- INGEST_EDGE_BUFFER_SIZE=64
# Message client for development
Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,14 @@
# ============================================================================
# CENTRALIZED MODEL CONFIGURATION
# Change these values to use different models throughout the application
# ============================================================================
x-model-config:
# Chat/LLM Model Configuration
llm-model: &llm-model "nvidia/llama-3.3-nemotron-super-49b-v1"

# Embedding Model Configuration
embedding-model: &embedding-model "nvidia/llama-3.2-nemoretriever-1b-vlm-embed-v1"

services:

# Main orchestrator server which stiches together all calls to different services to fulfill the user request
Expand Down Expand Up @@ -35,25 +46,16 @@ services:
VECTOR_DB_TOPK: ${VECTOR_DB_TOPK:-100}

##===LLM Model specific configurations===
APP_LLM_MODELNAME: ${APP_LLM_MODELNAME:-"meta/llama-3.1-8b-instruct"}
# Model name - pulls from centralized config at top of file (can be overridden by env var)
APP_LLM_MODELNAME: *llm-model
# url on which llm model is hosted. If "", Nvidia hosted API is used
APP_LLM_SERVERURL: ${APP_LLM_SERVERURL-""}

##===Query Rewriter Model specific configurations===
APP_QUERYREWRITER_MODELNAME: ${APP_QUERYREWRITER_MODELNAME:-"meta/llama-3.1-8b-instruct"}
# url on which query rewriter model is hosted. If "", Nvidia hosted API is used
APP_QUERYREWRITER_SERVERURL: ${APP_QUERYREWRITER_SERVERURL-"nim-llm-llama-8b-ms:8000"}
APP_LLM_SERVERURL: ${APP_LLM_SERVERURL:-""}

##===Embedding Model specific configurations===
# Model name - pulls from centralized config at top of file (can be overridden by env var)
APP_EMBEDDINGS_MODELNAME: *embedding-model
# url on which embedding model is hosted. If "", Nvidia hosted API is used
APP_EMBEDDINGS_SERVERURL: ${APP_EMBEDDINGS_SERVERURL-""}
APP_EMBEDDINGS_MODELNAME: ${APP_EMBEDDINGS_MODELNAME:-nvidia/nv-embedqa-mistral-7b-v2}

##===Reranking Model specific configurations===
# url on which ranking model is hosted. If "", Nvidia hosted API is used
APP_RANKING_SERVERURL: ${APP_RANKING_SERVERURL-""}
APP_RANKING_MODELNAME: ${APP_RANKING_MODELNAME:-nv-rerank-qa-mistral-4b:1}
ENABLE_RERANKER: ${ENABLE_RERANKER:-True}
APP_EMBEDDINGS_SERVERURL: ${APP_EMBEDDINGS_SERVERURL:-""}

NVIDIA_API_KEY: ${NGC_API_KEY:?"NGC_API_KEY is required"}

Expand All @@ -65,7 +67,7 @@ services:

# enable multi-turn conversation in the rag chain - this controls conversation history usage
# while doing query rewriting and in LLM prompt
ENABLE_MULTITURN: ${ENABLE_MULTITURN:-False}
ENABLE_MULTITURN: ${ENABLE_MULTITURN:-True}

# enable query rewriting for multiturn conversation in the rag chain.
# This will improve accuracy of the retrieiver pipeline but increase latency due to an additional LLM call
Expand Down Expand Up @@ -139,10 +141,10 @@ services:
context: ../../frontend
dockerfile: ./Dockerfile
args:
# Model name for LLM
NEXT_PUBLIC_MODEL_NAME: ${APP_LLM_MODELNAME:-meta/llama-3.1-8b-instruct}
# Model name for embeddings
NEXT_PUBLIC_EMBEDDING_MODEL: ${APP_EMBEDDINGS_MODELNAME:-nvidia/nv-embedqa-mistral-7b-v2}
# Model name for LLM - pulls from centralized config at top of file
NEXT_PUBLIC_MODEL_NAME: *llm-model
# Model name for embeddings - pulls from centralized config at top of file
NEXT_PUBLIC_EMBEDDING_MODEL: *embedding-model
# Model name for reranking
NEXT_PUBLIC_RERANKER_MODEL: ${APP_RANKING_MODELNAME:-nv-rerank-qa-mistral-4b:1}
# URL for rag server container
Expand Down
82 changes: 82 additions & 0 deletions community/ai-vws-sizing-advisor/deploy/compose/model_config.env
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
# ============================================================================
# CENTRALIZED MODEL CONFIGURATION
# ============================================================================
# This file centralizes all model configurations for the RAG system.
# Source this file or set these environment variables to change models.
#
# Usage:
# source model_config.env
# docker compose -f docker-compose-rag-server.yaml up
#
# ============================================================================

# ----------------------------------------------------------------------------
# CHAT/LLM MODEL CONFIGURATION
# ----------------------------------------------------------------------------
# The main language model used for generating responses
# Default: nvidia/llama-3.3-nemotron-super-49b-v1
#
# Other options:
# - meta/llama-3.1-405b-instruct
# - meta/llama-3.1-70b-instruct
# - meta/llama-3.1-8b-instruct
# - mistralai/mixtral-8x22b-instruct-v0.1
#
export APP_LLM_MODELNAME="nvidia/llama-3.3-nemotron-super-49b-v1"

# LLM Server URL (leave empty "" to use NVIDIA hosted API)
export APP_LLM_SERVERURL=""

# ----------------------------------------------------------------------------
# EMBEDDING MODEL CONFIGURATION
# ----------------------------------------------------------------------------
# The embedding model used for vectorizing documents and queries
# Default: nvidia/llama-3.2-nemoretriever-1b-vlm-embed-v1
#
# Other options:
# - nvidia/nv-embedqa-mistral-7b-v2
# - nvidia/nv-embed-v2
# - nvidia/llama-3.2-nv-embedqa-1b-v2
#
export APP_EMBEDDINGS_MODELNAME="nvidia/llama-3.2-nemoretriever-1b-vlm-embed-v1"

# Embedding Server URL (leave empty "" to use NVIDIA hosted API, or set to self-hosted)
# Example for self-hosted: "nemoretriever-embedding-ms:8000"
export APP_EMBEDDINGS_SERVERURL=""

# Embedding dimensions (adjust based on your embedding model)
# IMPORTANT: This MUST match your chosen embedding model!
# - nvidia/llama-3.2-nemoretriever-1b-vlm-embed-v1: 4096 (current default)
# - nvidia/nv-embedqa-mistral-7b-v2: 2048
# - nvidia/nv-embed-v2: 4096
export APP_EMBEDDINGS_DIMENSIONS="4096"

# ----------------------------------------------------------------------------
# REFLECTION MODEL CONFIGURATION (for response quality checking)
# ----------------------------------------------------------------------------
# Model used for reflection/self-checking if ENABLE_REFLECTION=true
export REFLECTION_LLM="mistralai/mixtral-8x22b-instruct-v0.1"
export REFLECTION_LLM_SERVERURL="nim-llm-mixtral-8x22b:8000"

# ----------------------------------------------------------------------------
# CAPTION MODEL CONFIGURATION (for image/chart understanding)
# ----------------------------------------------------------------------------
# Model used for generating captions for images, charts, and tables
export APP_NVINGEST_CAPTIONMODELNAME="meta/llama-3.2-11b-vision-instruct"
export APP_NVINGEST_CAPTIONENDPOINTURL="http://vlm-ms:8000/v1/chat/completions"
export VLM_CAPTION_MODEL_NAME="meta/llama-3.2-11b-vision-instruct"
export VLM_CAPTION_ENDPOINT="http://vlm-ms:8000/v1/chat/completions"

# ----------------------------------------------------------------------------
# ADDITIONAL NOTES
# ----------------------------------------------------------------------------
# 1. After changing models, you may need to rebuild containers:
# docker compose -f docker-compose-rag-server.yaml build --no-cache rag-playground
#
# 2. For self-hosted models, make sure the corresponding NIM services are running
#
# 3. The embedding dimensions must match your chosen embedding model
#
# 4. When switching between hosted and self-hosted, update both the model name
# and the server URL accordingly

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading