Skip to content

FluidInference/fluid-server

Repository files navigation

Fluid Server: Local AI server for your Windows apps

Discord Models

THIS PROJECT IS UNDER ACTIVE DEVELOPMENT Its not ready for production usage but serves as a good reference for hwo to run whisper on Qualcomm and Intel NPUs

A portable, packaged OpenAI-compatible server for Windows desktop applications. LLM, Transcription, embeddings, and vector DB, all out of the box.

Note that this does require you to run the .exe as a sepearte async process, like a local serving server in your application, and you will need to make requests to serve inference.

Features

Core Capabilities

  • LLM Chat Completions - OpenAI-compatible API with streaming, backed by llama.cpp and OpenVINO
  • Audio Transcription - Whisper models with NPU acceleration, backed by OpenVINO and Qualcomm QNN
  • Text Embeddings - Vector embeddings for search and RAG
  • Vector Database - LanceDB integration for multimodal storage

Hardware Acceleration

  • Intel NPU via OpenVINO backend
  • Qualcomm NPU via QNN (Snapdragon X Elite)
  • Vulkan GPU via llama-cpp

Quick Start

1. Download or Build

Option A: Download Release

  • Download fluid-server.exe from releases

Option B: Run from Source

# Install dependencies and run
uv sync
uv run

2. Run the Server

# Run with default settings
.\dist\fluid-server.exe

# Or with custom options
.\dist\fluid-server.exe --host 127.0.0.1 --port 8080

3. Test the API

Usage Examples

Basic Chat Completion

curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "qwen3-8b-int8-ov", "messages": [{"role": "user", "content": "Hello!"}]}'

Python Integration

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8080/v1", api_key="local")

# Chat with streaming
for chunk in client.chat.completions.create(
    model="qwen3-8b-int8-ov",
    messages=[{"role": "user", "content": "Hello!"}],
    stream=True
):
    print(chunk.choices[0].delta.content or "", end="")

Audio Transcription

curl -X POST http://localhost:8080/v1/audio/transcriptions \
  -F "[email protected]" \
  -F "model=whisper-large-v3-turbo-qnn"

Documentation

📖 Comprehensive Guides

FAQ

Why Python? Best ML ecosystem support and PyInstaller packaging.

Why not llama.cpp? We support multiple runtimes and AI accelerators beyond GGML.

Acknowledgements

Built using ty, FastAPI, Pydantic, ONNX Runtime, OpenAI Whisper, and various other AI libraries.

Runtime Technologies:

  • OpenVINO - Intel NPU and GPU acceleration
  • Qualcomm QNN - Snapdragon NPU optimization with HTP backend
  • ONNX Runtime - Cross-platform AI inference

About

Local AI server for your Windows apps.

Resources

License

Stars

Watchers

Forks

Packages

No packages published