THIS PROJECT IS UNDER ACTIVE DEVELOPMENT Its not ready for production usage but serves as a good reference for hwo to run whisper on Qualcomm and Intel NPUs
A portable, packaged OpenAI-compatible server for Windows desktop applications. LLM, Transcription, embeddings, and vector DB, all out of the box.
Note that this does require you to run the .exe as a sepearte async process, like a local serving server in your application, and you will need to make requests to serve inference.
Core Capabilities
- LLM Chat Completions - OpenAI-compatible API with streaming, backed by llama.cpp and OpenVINO
- Audio Transcription - Whisper models with NPU acceleration, backed by OpenVINO and Qualcomm QNN
- Text Embeddings - Vector embeddings for search and RAG
- Vector Database - LanceDB integration for multimodal storage
Hardware Acceleration
- Intel NPU via OpenVINO backend
- Qualcomm NPU via QNN (Snapdragon X Elite)
- Vulkan GPU via llama-cpp
Option A: Download Release
- Download
fluid-server.exe
from releases
Option B: Run from Source
# Install dependencies and run
uv sync
uv run
# Run with default settings
.\dist\fluid-server.exe
# Or with custom options
.\dist\fluid-server.exe --host 127.0.0.1 --port 8080
- Health Check: http://localhost:8080/health
- API Docs: http://localhost:8080/docs
- Models: http://localhost:8080/v1/models
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "qwen3-8b-int8-ov", "messages": [{"role": "user", "content": "Hello!"}]}'
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8080/v1", api_key="local")
# Chat with streaming
for chunk in client.chat.completions.create(
model="qwen3-8b-int8-ov",
messages=[{"role": "user", "content": "Hello!"}],
stream=True
):
print(chunk.choices[0].delta.content or "", end="")
curl -X POST http://localhost:8080/v1/audio/transcriptions \
-F "[email protected]" \
-F "model=whisper-large-v3-turbo-qnn"
📖 Comprehensive Guides
- NPU Support Guide - Intel & Qualcomm NPU configuration
- Integration Guide - Python, .NET, Node.js examples
- Development Guide - Setup, building, and contributing
- LanceDB Integration - Vector database and embeddings
- GGUF Model Support - Using any GGUF model
- Compilation Guide - Build system details
Why Python? Best ML ecosystem support and PyInstaller packaging.
Why not llama.cpp? We support multiple runtimes and AI accelerators beyond GGML.
Built using ty
, FastAPI
, Pydantic
, ONNX Runtime
, OpenAI Whisper
, and various other AI libraries.
Runtime Technologies:
OpenVINO
- Intel NPU and GPU accelerationQualcomm QNN
- Snapdragon NPU optimization with HTP backendONNX Runtime
- Cross-platform AI inference