GitHub - SearchSavior/OpenArc: Lightweight Inference server for OpenVINO

Note

OpenArc is under active development. Expect breaking changes.

OpenArc is an inference engine which makes using Intel devices as accelerators easier.

Powered by Optimum-Intel to leverage hardware acceleration on Intel CPUs, GPUs and NPUs through OpenVINO runtime, OpenArc integrates closely with Huggingface Transformers making the inference-work our codebase performs easy to understand.

Under the hood OpenArc implements a FastAPI layer over a growing collection of classes from Optimum-Intel which cover on a wide range of tasks and model architectures.

OpenArc currently supports text generation and text generation with vision over OpenAI API endpoints.

Support for speculative decoding, generating embeddings, speech tasks, image generation, PaddleOCR, and others are planned.

Features

OpenAI compatible endpoints
Validated OpenWebUI support, but it should work elsewhere
Load multiple vision/text models concurrently on multiple devices for hotswap/multi agent workflows
Most HuggingFace text generation models
Growing set of vision capable LLMs:
- Qwen2-VL
- Qwen2.5-VL
- Gemma 3

NEW Command Line Application!

Built with click and rich-click
OpenArc's server has been thoroughly documented there. Much cleaner!
Coupled with officual documentation this makes learning OpenVINO easier.

Performance metrics on every completion

ttft: time to generate first token
generation_time : time to generate the whole response
number of tokens: total generated tokens for that request (includes thinking tokens)
tokens per second: measures throughput.
average token latency: helpful for optimizing zero or few shot tasks

Command Line Application

OpenArc now has a command line application for interfacing with the server!

Gradio has been put to pasture and has been replaced with a brand new UX flow meant to make using and learning OpenVINO easier. GitHub, Reddit and forums everywhere are full of people who learned OpenVINO

To get started run

python openarc_cli.py --help

Which gives:

![NOTE] Whenever you get stuck simply add --help to see documentation.

Launch Server

To launch the server:

python openarc_cli.py serve start

For a more granular networking setup:

python openarc_cli.py serve start --start --openarc-port (your-port)

We save the host/port configuration to 'openarc-cli-config.yaml' file.

The CLI always sends commands to the server wherever you start it from laying groundwork for easier containerization in the future

Load a Model

To load a model open another temrinal:

python openarc_cli.py load --help

This menu gives a breakdown of how the many different optimzation parameters work and broadly how they can be used together.

Here are some example commands with Qwen3 and Qwen2.5-VL on GPU

To load a Qwen3 model:

python openarc_cli.py load --model path/to/model --model-type TEXT --device GPU.0

To load a Qwen-2.5-VL model:

python openarc_cli.py load --model path/to/model --model-type VISION --device GPU.0

The CLI application will surface C++ errors from the OpenVINO runtime as you tinker; in practice this is sort of like print debugging your LLM optimizations directly from the engine, often leading you directly into the source code to understand things from the inside.

In practice this helps get through the sometimes vague documentation, especially for edge cases.

Keep reading to see more about what models can be used with OpenArc and learn about model conversion.

System Requirments

OpenArc has been built on top of the OpenVINO runtime; as a result OpenArc supports the same range of hardware but requires device specifc drivers this document will not cover in-depth.
See [OpenVINO System Requirments](https://docs.openvino.ai/2025/about-openvino/ release-notes-openvino/system-requirements.html#cpu) to get the most updated information.
If you need help installing drivers:

After setting up the environment run

python openarc_cli.py tool device-detect

as a sanity test

Environment Setup

Ubuntu

Create the conda environment:

conda env create -f environment.yaml

Set your API key as an environment variable:

export OPENARC_API_KEY=<you-know-for-search>

Build Optimum-Intel from source to get the latest support:

pip install "optimum-intel[openvino] @ git+https://github.com/huggingface/optimum-intel"

Windows

Install Miniconda from here
Create the conda environment:

conda env create -f environment.yaml

Navigate to the directory containing the environment.yaml file and run

conda env create -f environment.yaml

Set your API key as an environment variable:

setx OPENARC_API_KEY openarc-api-key

Build Optimum-Intel from source to get the latest support:

pip install optimum[openvino]+https://github.com/huggingface/optimum-intel

[!Tips]

Avoid setting up the environment from IDE extensions.
Try not to use the environment for other ML projects. Soon we will have uv.

OpenWebUI

Note

I'm only going to cover the basics on OpenWebUI here. To learn more and set it up check out the OpenWebUI docs.

From the Connections menu add a new connection
Enter the server address and port where OpenArc is running followed by /v1 Example: http://0.0.0.0:8000/v1
Here you need to set the API key manually
When you hit the refresh button OpenWebUI sends a GET request to the OpenArc server to get the list of models at v1/models

Serverside logs should report:

"GET /v1/models HTTP/1.1" 200 OK

Other Frontends

OpenArc mostly conforms to the openai API specification. In practice this means other frontends, python classes and community tooling will be compatible.

Tested:

(mikupad)[https://github.com/lmg-anon/mikupad]

Usage:

Load the model you want to use from openarc_cli
Select the connection you just created and use the refresh button to update the list of models
if you use API keys and have a list of models these might be towards the bottom

Convert to OpenVINO IR

There are a few sources of models which can be used with OpenArc;

OpenVINO LLM Collection on HuggingFace
My HuggingFace repo
- My repo contains preconverted models for a variety of architectures and usecases
- OpenArc supports almost all of them
- Includes NSFW, ERP and "exotic" community finetunes that Intel doesn't host take advantage!
- These get updated regularly so check back often!
- If you read this here, mention it on Discord and I can quant a model you want to try.
Use the Optimum-CLI Conversion documentation to learn how you can convert models to OpenVINO IR.
Easily those craft conversion commands using my HF Space, Optimum-CLI-Tool_tool, a Gradio application which helps you GUI-ify an often research intensive process.
If you use the CLI tool and get an error about an unsupported architecture or "missing export config" follow the link, open an issue reference the model card and the maintainers will get back to you.

Here are some models to get started:

Models	Compressed Weights
Ministral-3b-instruct-int4_asym-ov	1.85 GB
Hermes-3-Llama-3.2-3B-awq-ov	1.8 GB
Llama-3.1-Tulu-3-8B-int4_asym-ov	4.68 GB
DeepSeek-R1-0528-Qwen3-8B-OpenVINO
Meta-Llama-3.1-8B-SurviveV3-int4_asym-awq-se-wqe-ov	4.68 GB
Rocinante-12B-v1.1-int4_sym-awq-se-ov	6.92 GB
Echo9Zulu/phi-4-int4_asym-awq-ov	8.11 GB
DeepSeek-R1-Distill-Qwen-14B-int4-awq-ov	7.68 GB
Homunculus-OpenVINO
Mistral-Small-24B-Instruct-2501-int4_asym-ov	12.9 GB
gemma-3-4b-it-int8_asym-ov	3.89 GB

If you use the CLI tool and get an error about an unsupported architecture follow the link, open an issue with references to the model card and the maintainers will get back to you.

Note

A naming convention for openvino converted models is coming soon.

Performance with OpenVINO runtime

Notes on the test:

No openvino optimization parameters were used
Fixed input length
I sent one user message
Quant strategies for models are not considered
I converted each of these models myself (I'm working on standardizing model cards to share this information more directly)
OpenVINO generates a cache on first inference so metrics are on second generation
Seconds were used for readability

Test System:

CPU: Xeon W-2255 (10c, 20t) @3.7ghz

GPU: 3x Arc A770 16GB Asrock Phantom

RAM: 128gb DDR4 ECC 2933 mhz

Disk: 4tb ironwolf, 1tb 970 Evo

OS: Ubuntu 24.04

Kernel: 6.9.4-060904-generic

Prompt: "We don't even have a chat template so strap in and let it ride!" max_new_tokens= 128

GPU Performance: 1x Arc A770

Model	Prompt Processing (sec)	Throughput (t/sec)	Duration (sec)	Size (GB)
Phi-4-mini-instruct-int4_asym-gptq-ov	0.41	47.25	3.10	2.3
Hermes-3-Llama-3.2-3B-int4_sym-awq-se-ov	0.27	64.18	0.98	1.8
Llama-3.1-Nemotron-Nano-8B-v1-int4_sym-awq-se-ov	0.32	47.99	2.96	4.7
phi-4-int4_asym-awq-se-ov	0.30	25.27	5.32	8.1
DeepSeek-R1-Distill-Qwen-14B-int4_sym-awq-se-ov	0.42	25.23	1.56	8.4
Mistral-Small-24B-Instruct-2501-int4_asym-ov	0.36	18.81	7.11	12.9

CPU Performance: Xeon W-2255

Model	Prompt Processing (sec)	Throughput (t/sec)	Duration (sec)	Size (GB)
Phi-4-mini-instruct-int4_asym-gptq-ov	1.02	20.44	7.23	2.3
Hermes-3-Llama-3.2-3B-int4_sym-awq-se-ov	1.06	23.66	3.01	1.8
Llama-3.1-Nemotron-Nano-8B-v1-int4_sym-awq-se-ov	2.53	13.22	12.14	4.7
phi-4-int4_asym-awq-se-ov	4	6.63	23.14	8.1
DeepSeek-R1-Distill-Qwen-14B-int4_sym-awq-se-ov	5.02	7.25	11.09	8.4
Mistral-Small-24B-Instruct-2501-int4_asym-ov	6.88	4.11	37.5	12.9
Nous-Hermes-2-Mixtral-8x7B-DPO-int4-sym-se-ov	15.56	6.67	34.60	24.2

Currently implemented Optimum-Intel classes:

These dictate what types models, architectures and tasks are well supported by OpenArc.

OVModelForCausalLM

OVModelForVisualCausalLM

If you are interested in implementing support for another task join our Discord and let me know; we can discuss.

Resources

Learn more about how to leverage your Intel devices for Machine Learning:

openvino_notebooks

Inference with Optimum-Intel

Optimum-Intel Transformers

NPU Devices

Acknowledgments

OpenArc stands on the shoulders of several other projects:

Thank for yoru work!!

Name		Name	Last commit message	Last commit date
Latest commit History 135 Commits
assets		assets
docs		docs
scripts		scripts
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yaml		environment.yaml
openarc-cli-config.yaml		openarc-cli-config.yaml
openarc_cli.py		openarc_cli.py
project.md		project.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Features

NEW Command Line Application!

Performance metrics on every completion

Command Line Application

Launch Server

Load a Model

System Requirments

Environment Setup

OpenWebUI

Other Frontends

Usage:

Convert to OpenVINO IR

Performance with OpenVINO runtime

Prompt: "We don't even have a chat template so strap in and let it ride!" max_new_tokens= 128

GPU Performance: 1x Arc A770

CPU Performance: Xeon W-2255

Currently implemented Optimum-Intel classes:

Resources

Acknowledgments

About

Uh oh!

Releases 2

Packages

Contributors 3

Languages

License

SearchSavior/OpenArc

Folders and files

Latest commit

History

Repository files navigation

Features

NEW Command Line Application!

Performance metrics on every completion

Command Line Application

Launch Server

Load a Model

System Requirments

Environment Setup

OpenWebUI

Other Frontends

Usage:

Convert to OpenVINO IR

Performance with OpenVINO runtime

Prompt: "We don't even have a chat template so strap in and let it ride!" max_new_tokens= 128

GPU Performance: 1x Arc A770

CPU Performance: Xeon W-2255

Currently implemented Optimum-Intel classes:

Resources

Acknowledgments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Contributors 3

Languages

Packages