Note
OpenArc is under active development. Expect breaking changes.
OpenArc is an inference engine which makes using Intel devices as accelerators easier.
Powered by Optimum-Intel to leverage hardware acceleration on Intel CPUs, GPUs and NPUs through OpenVINO runtime, OpenArc integrates closely with Huggingface Transformers making the inference-work our codebase performs easy to understand.
Under the hood OpenArc implements a FastAPI layer over a growing collection of classes from Optimum-Intel which cover on a wide range of tasks and model architectures.
OpenArc currently supports text generation and text generation with vision over OpenAI API endpoints.
Support for speculative decoding, generating embeddings, speech tasks, image generation, PaddleOCR, and others are planned.
- OpenAI compatible endpoints
- Validated OpenWebUI support, but it should work elsewhere
- Load multiple vision/text models concurrently on multiple devices for hotswap/multi agent workflows
- Most HuggingFace text generation models
- Growing set of vision capable LLMs:
- Qwen2-VL
- Qwen2.5-VL
- Gemma 3
- Built with click and rich-click
- OpenArc's server has been thoroughly documented there. Much cleaner!
- Coupled with officual documentation this makes learning OpenVINO easier.
- ttft: time to generate first token
- generation_time : time to generate the whole response
- number of tokens: total generated tokens for that request (includes thinking tokens)
- tokens per second: measures throughput.
- average token latency: helpful for optimizing zero or few shot tasks
OpenArc now has a command line application for interfacing with the server!
Gradio has been put to pasture and has been replaced with a brand new UX flow meant to make using and learning OpenVINO easier. GitHub, Reddit and forums everywhere are full of people who learned OpenVINO
To get started run
python openarc_cli.py --help
Which gives:
![NOTE] Whenever you get stuck simply add --help to see documentation.
To launch the server:
python openarc_cli.py serve start
For a more granular networking setup:
python openarc_cli.py serve start --start --openarc-port (your-port)
We save the host/port configuration to 'openarc-cli-config.yaml' file.
The CLI always sends commands to the server wherever you start it from laying groundwork for easier containerization in the future
To load a model open another temrinal:
python openarc_cli.py load --help
This menu gives a breakdown of how the many different optimzation parameters work and broadly how they can be used together.
Here are some example commands with Qwen3 and Qwen2.5-VL on GPU
To load a Qwen3 model:
python openarc_cli.py load --model path/to/model --model-type TEXT --device GPU.0
To load a Qwen-2.5-VL model:
python openarc_cli.py load --model path/to/model --model-type VISION --device GPU.0
The CLI application will surface C++ errors from the OpenVINO runtime as you tinker; in practice this is sort of like print debugging your LLM optimizations directly from the engine, often leading you directly into the source code to understand things from the inside.
In practice this helps get through the sometimes vague documentation, especially for edge cases.
Keep reading to see more about what models can be used with OpenArc and learn about model conversion.
-
OpenArc has been built on top of the OpenVINO runtime; as a result OpenArc supports the same range of hardware but requires device specifc drivers this document will not cover in-depth.
-
See [OpenVINO System Requirments](https://docs.openvino.ai/2025/about-openvino/ release-notes-openvino/system-requirements.html#cpu) to get the most updated information.
-
If you need help installing drivers:
After setting up the environment run
python openarc_cli.py tool device-detect
as a sanity test
Ubuntu
Create the conda environment:
conda env create -f environment.yaml
Set your API key as an environment variable:
export OPENARC_API_KEY=<you-know-for-search>
Build Optimum-Intel from source to get the latest support:
pip install "optimum-intel[openvino] @ git+https://github.com/huggingface/optimum-intel"
Windows
-
Install Miniconda from here
-
Create the conda environment:
conda env create -f environment.yaml
-
Navigate to the directory containing the environment.yaml file and run
conda env create -f environment.yaml
Set your API key as an environment variable:
setx OPENARC_API_KEY openarc-api-key
Build Optimum-Intel from source to get the latest support:
pip install optimum[openvino]+https://github.com/huggingface/optimum-intel
[!Tips]
- Avoid setting up the environment from IDE extensions.
- Try not to use the environment for other ML projects. Soon we will have uv.
Note
I'm only going to cover the basics on OpenWebUI here. To learn more and set it up check out the OpenWebUI docs.
-
From the Connections menu add a new connection
-
Enter the server address and port where OpenArc is running followed by /v1 Example: http://0.0.0.0:8000/v1
-
Here you need to set the API key manually
-
When you hit the refresh button OpenWebUI sends a GET request to the OpenArc server to get the list of models at v1/models
Serverside logs should report:
"GET /v1/models HTTP/1.1" 200 OK
OpenArc mostly conforms to the openai API specification. In practice this means other frontends, python classes and community tooling will be compatible.
Tested:
(mikupad)[https://github.com/lmg-anon/mikupad]
- Load the model you want to use from openarc_cli
- Select the connection you just created and use the refresh button to update the list of models
- if you use API keys and have a list of models these might be towards the bottom
Convert to OpenVINO IR
There are a few sources of models which can be used with OpenArc;
-
- My repo contains preconverted models for a variety of architectures and usecases
- OpenArc supports almost all of them
- Includes NSFW, ERP and "exotic" community finetunes that Intel doesn't host take advantage!
- These get updated regularly so check back often!
- If you read this here, mention it on Discord and I can quant a model you want to try.
-
Use the Optimum-CLI Conversion documentation to learn how you can convert models to OpenVINO IR.
-
Easily those craft conversion commands using my HF Space, Optimum-CLI-Tool_tool, a Gradio application which helps you GUI-ify an often research intensive process.
-
If you use the CLI tool and get an error about an unsupported architecture or "missing export config" follow the link, open an issue reference the model card and the maintainers will get back to you.
Here are some models to get started:
Models | Compressed Weights |
---|---|
Ministral-3b-instruct-int4_asym-ov | 1.85 GB |
Hermes-3-Llama-3.2-3B-awq-ov | 1.8 GB |
Llama-3.1-Tulu-3-8B-int4_asym-ov | 4.68 GB |
DeepSeek-R1-0528-Qwen3-8B-OpenVINO | |
Meta-Llama-3.1-8B-SurviveV3-int4_asym-awq-se-wqe-ov | 4.68 GB |
Rocinante-12B-v1.1-int4_sym-awq-se-ov | 6.92 GB |
Echo9Zulu/phi-4-int4_asym-awq-ov | 8.11 GB |
DeepSeek-R1-Distill-Qwen-14B-int4-awq-ov | 7.68 GB |
Homunculus-OpenVINO | |
Mistral-Small-24B-Instruct-2501-int4_asym-ov | 12.9 GB |
gemma-3-4b-it-int8_asym-ov | 3.89 GB |
If you use the CLI tool and get an error about an unsupported architecture follow the link, open an issue with references to the model card and the maintainers will get back to you.
Note
A naming convention for openvino converted models is coming soon.
Notes on the test:
- No openvino optimization parameters were used
- Fixed input length
- I sent one user message
- Quant strategies for models are not considered
- I converted each of these models myself (I'm working on standardizing model cards to share this information more directly)
- OpenVINO generates a cache on first inference so metrics are on second generation
- Seconds were used for readability
Test System:
CPU: Xeon W-2255 (10c, 20t) @3.7ghz
GPU: 3x Arc A770 16GB Asrock Phantom
RAM: 128gb DDR4 ECC 2933 mhz
Disk: 4tb ironwolf, 1tb 970 Evo
OS: Ubuntu 24.04
Kernel: 6.9.4-060904-generic
Model | Prompt Processing (sec) | Throughput (t/sec) | Duration (sec) | Size (GB) |
---|---|---|---|---|
Phi-4-mini-instruct-int4_asym-gptq-ov | 0.41 | 47.25 | 3.10 | 2.3 |
Hermes-3-Llama-3.2-3B-int4_sym-awq-se-ov | 0.27 | 64.18 | 0.98 | 1.8 |
Llama-3.1-Nemotron-Nano-8B-v1-int4_sym-awq-se-ov | 0.32 | 47.99 | 2.96 | 4.7 |
phi-4-int4_asym-awq-se-ov | 0.30 | 25.27 | 5.32 | 8.1 |
DeepSeek-R1-Distill-Qwen-14B-int4_sym-awq-se-ov | 0.42 | 25.23 | 1.56 | 8.4 |
Mistral-Small-24B-Instruct-2501-int4_asym-ov | 0.36 | 18.81 | 7.11 | 12.9 |
Model | Prompt Processing (sec) | Throughput (t/sec) | Duration (sec) | Size (GB) |
---|---|---|---|---|
Phi-4-mini-instruct-int4_asym-gptq-ov | 1.02 | 20.44 | 7.23 | 2.3 |
Hermes-3-Llama-3.2-3B-int4_sym-awq-se-ov | 1.06 | 23.66 | 3.01 | 1.8 |
Llama-3.1-Nemotron-Nano-8B-v1-int4_sym-awq-se-ov | 2.53 | 13.22 | 12.14 | 4.7 |
phi-4-int4_asym-awq-se-ov | 4 | 6.63 | 23.14 | 8.1 |
DeepSeek-R1-Distill-Qwen-14B-int4_sym-awq-se-ov | 5.02 | 7.25 | 11.09 | 8.4 |
Mistral-Small-24B-Instruct-2501-int4_asym-ov | 6.88 | 4.11 | 37.5 | 12.9 |
Nous-Hermes-2-Mixtral-8x7B-DPO-int4-sym-se-ov | 15.56 | 6.67 | 34.60 | 24.2 |
These dictate what types models, architectures and tasks are well supported by OpenArc.
If you are interested in implementing support for another task join our Discord and let me know; we can discuss.
Learn more about how to leverage your Intel devices for Machine Learning:
OpenArc stands on the shoulders of several other projects:
Thank for yoru work!!