NVIDIA · chloecrozier · Jan 8, 2026
diff --git a/community/ai-vws-sizing-advisor/CHANGELOG.md b/community/ai-vws-sizing-advisor/CHANGELOG.md
@@ -3,6 +3,46 @@ All notable changes to this project will be documented in this file.
 The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
 
 
+## [2.3] - 2026-01-08
+
+This release focuses on improved sizing recommendations, enhanced Nemotron model integration, and comprehensive documentation updates.
+
+### Added
+- **Demo Screenshots** — Added visual examples showcasing the Configuration Wizard, RAG-powered sizing recommendations, and Local Deployment verification
+- **Official Documentation Link** — Added link to [NVIDIA vGPU Docs Hub](https://docs.nvidia.com/vgpu/toolkits/sizing-advisor/latest/intro.html) in README
+
+### Changed
+- **README Overhaul** — Reorganized documentation to highlight NVIDIA Nemotron models
+  - Llama-3.3-Nemotron-Super-49B powers the RAG backend
+  - Nemotron-3 Nano 30B (FP8) as default for workload sizing
+  - New Demo section with screenshots demonstrating key features
+
+- **Sizing Recommendation Improvements**
+  - Enhanced 95% usable capacity rule for profile selection (5% reserved for system overhead)
+  - Improved profile selection logic: picks smallest profile where (profile × 0.95) >= workload
+  - Better handling of edge cases near profile boundaries
+
+- **GPU Passthrough Logic**
+  - Automatic passthrough recommendation when workload exceeds max single vGPU profile
+  - Clearer passthrough examples in RAG context (e.g., 92GB on BSE → 2× BSE GPU passthrough)
+  - Calculator now returns `vgpu_profile: null` with multi-GPU passthrough recommendation
+
+- **vLLM Local Deployment**
+  - Updated to vLLM v0.12.0 for proper NemotronH (hybrid Mamba-Transformer) architecture support
+  - Improved GPU memory utilization calculations for local testing
+  - Better max-model-len auto-detection (only set when explicitly specified)
+
+- **Chat Improvements**
+  - Enhanced conversational mode with vGPU configuration context
+  - Better model extraction from sizing responses for follow-up questions
+  - Improved context handling for RAG vs inference workload discussions
+
+### Improved
+- **Nemotron Model Integration**
+  - Default model changed to Nemotron-3 Nano 30B FP8 in configuration wizard
+  - Nemotron thinking prompt support for enhanced reasoning
+  - Better model matching for Nemotron variants in calculator
+
 ## [2.2] - 2025-11-04
 
 ### Changed

diff --git a/community/ai-vws-sizing-advisor/README.md b/community/ai-vws-sizing-advisor/README.md
@@ -1,18 +1,67 @@
 # AI vWS Sizing Advisor
 
+<p align="center">
+  <img src="deployment_examples/example_rag_config.png" alt="AI vWS Sizing Advisor" width="800">
+</p>
+
+<p align="center">
+  <strong>RAG-powered vGPU sizing recommendations for AI Virtual Workstations</strong><br>
+  Powered by NVIDIA NeMo™ and Nemotron models
+</p>
+
+<p align="center">
+  <a href="https://docs.nvidia.com/vgpu/toolkits/sizing-advisor/latest/intro.html">Official Documentation</a> •
+  <a href="#demo">Demo</a> •
+  <a href="#deployment">Quick Start</a> •
+  <a href="./CHANGELOG.md">Changelog</a>
+</p>
+
+---
+
 ## Overview
 
 AI vWS Sizing Advisor is a RAG-powered tool that helps you determine the optimal NVIDIA vGPU sizing configuration for AI workloads on NVIDIA AI Virtual Workstation (AI vWS). Using NVIDIA vGPU documentation and best practices, it provides tailored recommendations for optimal performance and resource efficiency.
 
+### Powered by NVIDIA Nemotron
+
+This tool leverages **NVIDIA Nemotron models** for intelligent sizing recommendations:
+
+- **[Llama-3.3-Nemotron-Super-49B](https://build.nvidia.com/nvidia/llama-3_3-nemotron-super-49b-v1)** — Powers the RAG backend for intelligent conversational sizing guidance
+- **[Nemotron-3 Nano 30B](https://build.nvidia.com/nvidia/nvidia-nemotron-3-nano-30b-a3b-fp8)** — Default model for workload sizing calculations (FP8 optimized)
+
+### Key Capabilities
+
 Enter your workload requirements and receive validated recommendations including:
 
-- **vGPU Profile** - Recommended profile (e.g., L40S-24Q) based on your workload
-- **Resource Requirements** - vCPUs, GPU memory, system RAM needed
-- **Performance Estimates** - Expected latency, throughput, and time to first token
-- **Live Testing** - Instantly deploy and validate your configuration locally using vLLM containers
+- **vGPU Profile** — Recommended profile (e.g., L40S-24Q) based on your workload
+- **Resource Requirements** — vCPUs, GPU memory, system RAM needed
+- **Performance Estimates** — Expected latency, throughput, and time to first token
+- **Live Testing** — Instantly deploy and validate your configuration locally using vLLM containers
 
 The tool differentiates between RAG and inference workloads by accounting for embedding vectors and database overhead. It intelligently suggests GPU passthrough when jobs exceed standard vGPU profile limits.
 
+---
+
+## Demo
+
+### Configuration Wizard
+
+Configure your workload parameters including model selection, GPU type, quantization, and token sizes:
+
+<p align="center">
+  <img src="deployment_examples/configuration_wizard.png" alt="Configuration Wizard" width="700">
+</p>
+
+### Local Deployment Verification
+
+Validate your configuration by deploying a vLLM container locally and comparing actual GPU memory usage against estimates:
+
+<p align="center">
+  <img src="deployment_examples/local_deployment.png" alt="Local Deployment" width="700">
+</p>
+
+---
+
 ## Prerequisites
 
 ### Hardware
@@ -44,8 +93,10 @@ docker run --rm --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi
 > **Note:** Docker must be at `/usr/bin/docker` (verified in `deploy/compose/docker-compose-rag-server.yaml`). User must be in docker group or have socket permissions.
 
 ### API Keys
-- **NVIDIA Build API Key** (Required) - [Get your key](https://build.nvidia.com/settings/api-keys)
-- **HuggingFace Token** (Optional) - [Create token](https://huggingface.co/settings/tokens) for gated models
+- **NVIDIA Build API Key** (Required) — [Get your key](https://build.nvidia.com/settings/api-keys)
+- **HuggingFace Token** (Optional) — [Create token](https://huggingface.co/settings/tokens) for gated models
+
+---
 
 ## Deployment
 
@@ -74,28 +125,32 @@ npm install
 npm run dev
 ```
 
+---
+
 ## Usage
 
-2. **Select Workload Type:** RAG or Inference
+1. **Select Workload Type:** RAG or Inference
 
-3. **Enter Parameters:**
-   - Model name (e.g., `meta-llama/Llama-2-7b-chat-hf`)
+2. **Enter Parameters:**
+   - Model name (default: **Nemotron-3 Nano 30B FP8**)
    - GPU type
    - Prompt size (input tokens)
    - Response size (output tokens)
-   - Quantization (FP16, INT8, INT4)
+   - Quantization (FP16, FP8, INT8, INT4)
    - For RAG: Embedding model and vector dimensions
 
-4. **View Recommendations:**
+3. **View Recommendations:**
    - Recommended vGPU profiles
    - Resource requirements (vCPUs, RAM, GPU memory)
    - Performance estimates
 
-5. **Test Locally** (optional):
+4. **Test Locally** (optional):
    - Run local inference with a containerized vLLM server
    - View performance metrics
    - Compare actual results versus suggested profile configuration
 
+---
+
 ## Management Commands
 
 ```bash
@@ -120,6 +175,8 @@ The stop script automatically performs Docker cleanup operations:
 - Optionally removes dangling images (`--cleanup-images`)
 - Optionally removes all data volumes (`--volumes`)
 
+---
+
 ## Adding Documents to RAG Context
 
 The tool includes NVIDIA vGPU documentation by default. To add your own:
@@ -134,8 +191,7 @@ curl -X POST -F "file=@./vgpu_docs/your-document.pdf" http://localhost:8082/v1/i
 
 **Supported formats:** PDF, TXT, DOCX, HTML, PPTX
 
-
-
+---
 
 ## License
 
@@ -145,6 +201,6 @@ Models governed by [NVIDIA AI Foundation Models Community License](https://docs.
 
 ---
 
-**Version:** 2.2 (November 2025) - See [CHANGELOG.md](./CHANGELOG.md)
+**Version:** 2.3 (January 2026) — See [CHANGELOG.md](./CHANGELOG.md)
 
-**Support:** [GitHub Issues](https://github.com/NVIDIA/GenerativeAIExamples/issues) | [NVIDIA Forums](https://forums.developer.nvidia.com/)
+**Support:** [GitHub Issues](https://github.com/NVIDIA/GenerativeAIExamples/issues) | [NVIDIA Forums](https://forums.developer.nvidia.com/) | [Official Docs](https://docs.nvidia.com/vgpu/toolkits/sizing-advisor/latest/intro.html)
diff --git a/community/ai-vws-sizing-advisor/deploy/compose/docker-compose-ingestor-server.yaml b/community/ai-vws-sizing-advisor/deploy/compose/docker-compose-ingestor-server.yaml
@@ -1,3 +1,11 @@
+# ============================================================================
+# CENTRALIZED MODEL CONFIGURATION
+# Change these values to use different models throughout the application
+# ============================================================================
+x-model-config:
+  # Embedding Model Configuration
+  embedding-model: &embedding-model "nvidia/llama-3.2-nemoretriever-1b-vlm-embed-v1"
+
 services:
 
   # Main ingestor server which is responsible for ingestion
@@ -38,10 +46,14 @@ services:
       NGC_API_KEY: ${NGC_API_KEY:?"NGC_API_KEY is required"}
 
       ##===Embedding Model specific configurations===
+      # Model name - pulls from centralized config at top of file (can be overridden by env var)
+      APP_EMBEDDINGS_MODELNAME: *embedding-model
       # url on which embedding model is hosted. If "", Nvidia hosted API is used
-      APP_EMBEDDINGS_SERVERURL: ${APP_EMBEDDINGS_SERVERURL-"nemoretriever-embedding-ms:8000"}
-      APP_EMBEDDINGS_MODELNAME: ${APP_EMBEDDINGS_MODELNAME:-nvidia/nv-embedqa-mistral-7b-v2}
-      APP_EMBEDDINGS_DIMENSIONS: ${APP_EMBEDDINGS_DIMENSIONS:-2048}
+      APP_EMBEDDINGS_SERVERURL: ${APP_EMBEDDINGS_SERVERURL:-"nemoretriever-embedding-ms:8000"}
+      # Embedding dimensions - IMPORTANT: Must match your embedding model!
+      # nvidia/llama-3.2-nemoretriever-1b-vlm-embed-v1: 4096
+      # nvidia/nv-embedqa-mistral-7b-v2: 2048
+      APP_EMBEDDINGS_DIMENSIONS: ${APP_EMBEDDINGS_DIMENSIONS:-4096}
 
       ##===NV-Ingest Connection Configurations=======
       APP_NVINGEST_MESSAGECLIENTHOSTNAME: ${APP_NVINGEST_MESSAGECLIENTHOSTNAME:-"nv-ingest-ms-runtime"}
@@ -115,9 +127,10 @@ services:
       - AUDIO_INFER_PROTOCOL=grpc
       - CUDA_VISIBLE_DEVICES=0
       - MAX_INGEST_PROCESS_WORKERS=${MAX_INGEST_PROCESS_WORKERS:-16}
-      - EMBEDDING_NIM_MODEL_NAME=${EMBEDDING_NIM_MODEL_NAME:-${APP_EMBEDDINGS_MODELNAME:-nvidia/nv-embedqa-7b-v2}}
+      # Embedding model - uses APP_EMBEDDINGS_MODELNAME which pulls from centralized config
+      - EMBEDDING_NIM_MODEL_NAME=${APP_EMBEDDINGS_MODELNAME:-nvidia/llama-3.2-nemoretriever-1b-vlm-embed-v1}
       # Incase of self-hosted embedding model, use the endpoint url as - https://integrate.api.nvidia.com/v1
-      - EMBEDDING_NIM_ENDPOINT=${EMBEDDING_NIM_ENDPOINT:-${APP_EMBEDDINGS_SERVERURL-http://nemoretriever-embedding-ms:8000/v1}}
+      - EMBEDDING_NIM_ENDPOINT=${EMBEDDING_NIM_ENDPOINT:-http://nemoretriever-embedding-ms:8000/v1}
       - INGEST_LOG_LEVEL=DEFAULT
       - INGEST_EDGE_BUFFER_SIZE=64
       # Message client for development

diff --git a/community/ai-vws-sizing-advisor/deploy/compose/docker-compose-rag-server.yaml b/community/ai-vws-sizing-advisor/deploy/compose/docker-compose-rag-server.yaml
@@ -1,3 +1,14 @@
+# ============================================================================
+# CENTRALIZED MODEL CONFIGURATION
+# Change these values to use different models throughout the application
+# ============================================================================
+x-model-config:
+  # Chat/LLM Model Configuration
+  llm-model: &llm-model "nvidia/llama-3.3-nemotron-super-49b-v1"
+
+  # Embedding Model Configuration
+  embedding-model: &embedding-model "nvidia/llama-3.2-nemoretriever-1b-vlm-embed-v1"
+
 services:
 
   # Main orchestrator server which stiches together all calls to different services to fulfill the user request
@@ -35,25 +46,16 @@ services:
       VECTOR_DB_TOPK: ${VECTOR_DB_TOPK:-100}
 
       ##===LLM Model specific configurations===
-      APP_LLM_MODELNAME: ${APP_LLM_MODELNAME:-"meta/llama-3.1-8b-instruct"}
+      # Model name - pulls from centralized config at top of file (can be overridden by env var)
+      APP_LLM_MODELNAME: *llm-model
       # url on which llm model is hosted. If "", Nvidia hosted API is used
-      APP_LLM_SERVERURL: ${APP_LLM_SERVERURL-""}
-
-      ##===Query Rewriter Model specific configurations===
-      APP_QUERYREWRITER_MODELNAME: ${APP_QUERYREWRITER_MODELNAME:-"meta/llama-3.1-8b-instruct"}
-      # url on which query rewriter model is hosted. If "", Nvidia hosted API is used
-      APP_QUERYREWRITER_SERVERURL: ${APP_QUERYREWRITER_SERVERURL-"nim-llm-llama-8b-ms:8000"}
+      APP_LLM_SERVERURL: ${APP_LLM_SERVERURL:-""}
 
       ##===Embedding Model specific configurations===
+      # Model name - pulls from centralized config at top of file (can be overridden by env var)
+      APP_EMBEDDINGS_MODELNAME: *embedding-model
       # url on which embedding model is hosted. If "", Nvidia hosted API is used
-      APP_EMBEDDINGS_SERVERURL: ${APP_EMBEDDINGS_SERVERURL-""}
-      APP_EMBEDDINGS_MODELNAME: ${APP_EMBEDDINGS_MODELNAME:-nvidia/nv-embedqa-mistral-7b-v2}
-
-      ##===Reranking Model specific configurations===
-      # url on which ranking model is hosted. If "", Nvidia hosted API is used
-      APP_RANKING_SERVERURL: ${APP_RANKING_SERVERURL-""}
-      APP_RANKING_MODELNAME: ${APP_RANKING_MODELNAME:-nv-rerank-qa-mistral-4b:1}
-      ENABLE_RERANKER: ${ENABLE_RERANKER:-True}
+      APP_EMBEDDINGS_SERVERURL: ${APP_EMBEDDINGS_SERVERURL:-""}
 
       NVIDIA_API_KEY: ${NGC_API_KEY:?"NGC_API_KEY is required"}
 
@@ -65,7 +67,7 @@ services:
 
       # enable multi-turn conversation in the rag chain - this controls conversation history usage
       # while doing query rewriting and in LLM prompt
-      ENABLE_MULTITURN: ${ENABLE_MULTITURN:-False}
+      ENABLE_MULTITURN: ${ENABLE_MULTITURN:-True}
 
       # enable query rewriting for multiturn conversation in the rag chain.
       # This will improve accuracy of the retrieiver pipeline but increase latency due to an additional LLM call
@@ -139,10 +141,10 @@ services:
       context: ../../frontend
       dockerfile: ./Dockerfile
       args:
-        # Model name for LLM
-        NEXT_PUBLIC_MODEL_NAME: ${APP_LLM_MODELNAME:-meta/llama-3.1-8b-instruct}
-        # Model name for embeddings
-        NEXT_PUBLIC_EMBEDDING_MODEL: ${APP_EMBEDDINGS_MODELNAME:-nvidia/nv-embedqa-mistral-7b-v2}
+        # Model name for LLM - pulls from centralized config at top of file
+        NEXT_PUBLIC_MODEL_NAME: *llm-model
+        # Model name for embeddings - pulls from centralized config at top of file
+        NEXT_PUBLIC_EMBEDDING_MODEL: *embedding-model
         # Model name for reranking
         NEXT_PUBLIC_RERANKER_MODEL: ${APP_RANKING_MODELNAME:-nv-rerank-qa-mistral-4b:1}
         # URL for rag server container

diff --git a/community/ai-vws-sizing-advisor/deploy/compose/model_config.env b/community/ai-vws-sizing-advisor/deploy/compose/model_config.env
@@ -0,0 +1,82 @@
+# ============================================================================
+# CENTRALIZED MODEL CONFIGURATION
+# ============================================================================
+# This file centralizes all model configurations for the RAG system.
+# Source this file or set these environment variables to change models.
+#
+# Usage:
+#   source model_config.env
+#   docker compose -f docker-compose-rag-server.yaml up
+#
+# ============================================================================
+
+# ----------------------------------------------------------------------------
+# CHAT/LLM MODEL CONFIGURATION
+# ----------------------------------------------------------------------------
+# The main language model used for generating responses
+# Default: nvidia/llama-3.3-nemotron-super-49b-v1
+#
+# Other options:
+#   - meta/llama-3.1-405b-instruct
+#   - meta/llama-3.1-70b-instruct
+#   - meta/llama-3.1-8b-instruct
+#   - mistralai/mixtral-8x22b-instruct-v0.1
+#
+export APP_LLM_MODELNAME="nvidia/llama-3.3-nemotron-super-49b-v1"
+
+# LLM Server URL (leave empty "" to use NVIDIA hosted API)
+export APP_LLM_SERVERURL=""
+
+# ----------------------------------------------------------------------------
+# EMBEDDING MODEL CONFIGURATION
+# ----------------------------------------------------------------------------
+# The embedding model used for vectorizing documents and queries
+# Default: nvidia/llama-3.2-nemoretriever-1b-vlm-embed-v1
+#
+# Other options:
+#   - nvidia/nv-embedqa-mistral-7b-v2
+#   - nvidia/nv-embed-v2
+#   - nvidia/llama-3.2-nv-embedqa-1b-v2
+#
+export APP_EMBEDDINGS_MODELNAME="nvidia/llama-3.2-nemoretriever-1b-vlm-embed-v1"
+
+# Embedding Server URL (leave empty "" to use NVIDIA hosted API, or set to self-hosted)
+# Example for self-hosted: "nemoretriever-embedding-ms:8000"
+export APP_EMBEDDINGS_SERVERURL=""
+
+# Embedding dimensions (adjust based on your embedding model)
+# IMPORTANT: This MUST match your chosen embedding model!
+# - nvidia/llama-3.2-nemoretriever-1b-vlm-embed-v1: 4096 (current default)
+# - nvidia/nv-embedqa-mistral-7b-v2: 2048
+# - nvidia/nv-embed-v2: 4096
+export APP_EMBEDDINGS_DIMENSIONS="4096"
+
+# ----------------------------------------------------------------------------
+# REFLECTION MODEL CONFIGURATION (for response quality checking)
+# ----------------------------------------------------------------------------
+# Model used for reflection/self-checking if ENABLE_REFLECTION=true
+export REFLECTION_LLM="mistralai/mixtral-8x22b-instruct-v0.1"
+export REFLECTION_LLM_SERVERURL="nim-llm-mixtral-8x22b:8000"
+
+# ----------------------------------------------------------------------------
+# CAPTION MODEL CONFIGURATION (for image/chart understanding)
+# ----------------------------------------------------------------------------
+# Model used for generating captions for images, charts, and tables
+export APP_NVINGEST_CAPTIONMODELNAME="meta/llama-3.2-11b-vision-instruct"
+export APP_NVINGEST_CAPTIONENDPOINTURL="http://vlm-ms:8000/v1/chat/completions"
+export VLM_CAPTION_MODEL_NAME="meta/llama-3.2-11b-vision-instruct"
+export VLM_CAPTION_ENDPOINT="http://vlm-ms:8000/v1/chat/completions"
+
+# ----------------------------------------------------------------------------
+# ADDITIONAL NOTES
+# ----------------------------------------------------------------------------
+# 1. After changing models, you may need to rebuild containers:
+#    docker compose -f docker-compose-rag-server.yaml build --no-cache rag-playground
+#
+# 2. For self-hosted models, make sure the corresponding NIM services are running
+#
+# 3. The embedding dimensions must match your chosen embedding model
+#
+# 4. When switching between hosted and self-hosted, update both the model name
+#    and the server URL accordingly
+
diff --git a/community/ai-vws-sizing-advisor/deployment_examples/configuration_wizard.png b/community/ai-vws-sizing-advisor/deployment_examples/configuration_wizard.png
diff --git a/community/ai-vws-sizing-advisor/deployment_examples/example_rag_config.png b/community/ai-vws-sizing-advisor/deployment_examples/example_rag_config.png
diff --git a/community/ai-vws-sizing-advisor/deployment_examples/local_deployment.png b/community/ai-vws-sizing-advisor/deployment_examples/local_deployment.png