Skip to content

SearchSavior/OpenArc

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CLI Help Screen

Discord Hugging Face

Note

OpenArc is under active development. Expect breaking changes.

OpenArc is an inference engine which makes using Intel devices as accelerators easier.

Powered by Optimum-Intel to leverage hardware acceleration on Intel CPUs, GPUs and NPUs through OpenVINO runtime, OpenArc integrates closely with Huggingface Transformers making the inference-work our codebase performs easy to understand.

Under the hood OpenArc implements a FastAPI layer over a growing collection of classes from Optimum-Intel which cover on a wide range of tasks and model architectures.

OpenArc currently supports text generation and text generation with vision over OpenAI API endpoints.

Support for speculative decoding, generating embeddings, speech tasks, image generation, PaddleOCR, and others are planned.

Features

  • OpenAI compatible endpoints
  • Validated OpenWebUI support, but it should work elsewhere
  • Load multiple vision/text models concurrently on multiple devices for hotswap/multi agent workflows
  • Most HuggingFace text generation models
  • Growing set of vision capable LLMs:
    • Qwen2-VL
    • Qwen2.5-VL
    • Gemma 3

NEW Command Line Application!

  • Built with click and rich-click
  • OpenArc's server has been thoroughly documented there. Much cleaner!
  • Coupled with officual documentation this makes learning OpenVINO easier.

Performance metrics on every completion

  • ttft: time to generate first token
  • generation_time : time to generate the whole response
  • number of tokens: total generated tokens for that request (includes thinking tokens)
  • tokens per second: measures throughput.
  • average token latency: helpful for optimizing zero or few shot tasks

Command Line Application

OpenArc now has a command line application for interfacing with the server!

Gradio has been put to pasture and has been replaced with a brand new UX flow meant to make using and learning OpenVINO easier. GitHub, Reddit and forums everywhere are full of people who learned OpenVINO

To get started run

python openarc_cli.py --help

Which gives:

CLI Help Screen

![NOTE] Whenever you get stuck simply add --help to see documentation.

Launch Server

To launch the server:

python openarc_cli.py serve start

For a more granular networking setup:

python openarc_cli.py serve start --start --openarc-port (your-port)

CLI serve Screen

We save the host/port configuration to 'openarc-cli-config.yaml' file.

The CLI always sends commands to the server wherever you start it from laying groundwork for easier containerization in the future

Load a Model

To load a model open another temrinal:

python openarc_cli.py load --help

This menu gives a breakdown of how the many different optimzation parameters work and broadly how they can be used together.

CLI Help Screen

Here are some example commands with Qwen3 and Qwen2.5-VL on GPU

To load a Qwen3 model:

python openarc_cli.py load --model path/to/model --model-type TEXT --device GPU.0

To load a Qwen-2.5-VL model:

python openarc_cli.py load --model path/to/model --model-type VISION --device GPU.0

The CLI application will surface C++ errors from the OpenVINO runtime as you tinker; in practice this is sort of like print debugging your LLM optimizations directly from the engine, often leading you directly into the source code to understand things from the inside.

In practice this helps get through the sometimes vague documentation, especially for edge cases.

Keep reading to see more about what models can be used with OpenArc and learn about model conversion.

System Requirments

After setting up the environment run

python openarc_cli.py tool device-detect

as a sanity test

device-detect

Environment Setup

Ubuntu

Create the conda environment:

conda env create -f environment.yaml

Set your API key as an environment variable:

export OPENARC_API_KEY=<you-know-for-search>

Build Optimum-Intel from source to get the latest support:

pip install "optimum-intel[openvino] @ git+https://github.com/huggingface/optimum-intel"
Windows
  1. Install Miniconda from here

  2. Create the conda environment:

conda env create -f environment.yaml
  1. Navigate to the directory containing the environment.yaml file and run

    conda env create -f environment.yaml

Set your API key as an environment variable:

setx OPENARC_API_KEY openarc-api-key

Build Optimum-Intel from source to get the latest support:

pip install optimum[openvino]+https://github.com/huggingface/optimum-intel

[!Tips]

  • Avoid setting up the environment from IDE extensions.
  • Try not to use the environment for other ML projects. Soon we will have uv.

OpenWebUI

Note

I'm only going to cover the basics on OpenWebUI here. To learn more and set it up check out the OpenWebUI docs.

  • From the Connections menu add a new connection

  • Enter the server address and port where OpenArc is running followed by /v1 Example: http://0.0.0.0:8000/v1

  • Here you need to set the API key manually

  • When you hit the refresh button OpenWebUI sends a GET request to the OpenArc server to get the list of models at v1/models

Serverside logs should report:

"GET /v1/models HTTP/1.1" 200 OK

Other Frontends

OpenArc mostly conforms to the openai API specification. In practice this means other frontends, python classes and community tooling will be compatible.

Tested:

(mikupad)[https://github.com/lmg-anon/mikupad]

Usage:

  • Load the model you want to use from openarc_cli
  • Select the connection you just created and use the refresh button to update the list of models
  • if you use API keys and have a list of models these might be towards the bottom

Convert to OpenVINO IR

There are a few sources of models which can be used with OpenArc;

  • OpenVINO LLM Collection on HuggingFace

  • My HuggingFace repo

    • My repo contains preconverted models for a variety of architectures and usecases
    • OpenArc supports almost all of them
    • Includes NSFW, ERP and "exotic" community finetunes that Intel doesn't host take advantage!
    • These get updated regularly so check back often!
    • If you read this here, mention it on Discord and I can quant a model you want to try.
  • Use the Optimum-CLI Conversion documentation to learn how you can convert models to OpenVINO IR.

  • Easily those craft conversion commands using my HF Space, Optimum-CLI-Tool_tool, a Gradio application which helps you GUI-ify an often research intensive process.

  • If you use the CLI tool and get an error about an unsupported architecture or "missing export config" follow the link, open an issue reference the model card and the maintainers will get back to you.

Here are some models to get started:

Models Compressed Weights
Ministral-3b-instruct-int4_asym-ov 1.85 GB
Hermes-3-Llama-3.2-3B-awq-ov 1.8 GB
Llama-3.1-Tulu-3-8B-int4_asym-ov 4.68 GB
DeepSeek-R1-0528-Qwen3-8B-OpenVINO
Meta-Llama-3.1-8B-SurviveV3-int4_asym-awq-se-wqe-ov 4.68 GB
Rocinante-12B-v1.1-int4_sym-awq-se-ov 6.92 GB
Echo9Zulu/phi-4-int4_asym-awq-ov 8.11 GB
DeepSeek-R1-Distill-Qwen-14B-int4-awq-ov 7.68 GB
Homunculus-OpenVINO
Mistral-Small-24B-Instruct-2501-int4_asym-ov 12.9 GB
gemma-3-4b-it-int8_asym-ov 3.89 GB

If you use the CLI tool and get an error about an unsupported architecture follow the link, open an issue with references to the model card and the maintainers will get back to you.

Note

A naming convention for openvino converted models is coming soon.

Performance with OpenVINO runtime

Notes on the test:

  • No openvino optimization parameters were used
  • Fixed input length
  • I sent one user message
  • Quant strategies for models are not considered
  • I converted each of these models myself (I'm working on standardizing model cards to share this information more directly)
  • OpenVINO generates a cache on first inference so metrics are on second generation
  • Seconds were used for readability

Test System:

CPU: Xeon W-2255 (10c, 20t) @3.7ghz

GPU: 3x Arc A770 16GB Asrock Phantom

RAM: 128gb DDR4 ECC 2933 mhz

Disk: 4tb ironwolf, 1tb 970 Evo

OS: Ubuntu 24.04

Kernel: 6.9.4-060904-generic

Prompt: "We don't even have a chat template so strap in and let it ride!" max_new_tokens= 128

GPU Performance: 1x Arc A770

Model Prompt Processing (sec) Throughput (t/sec) Duration (sec) Size (GB)
Phi-4-mini-instruct-int4_asym-gptq-ov 0.41 47.25 3.10 2.3
Hermes-3-Llama-3.2-3B-int4_sym-awq-se-ov 0.27 64.18 0.98 1.8
Llama-3.1-Nemotron-Nano-8B-v1-int4_sym-awq-se-ov 0.32 47.99 2.96 4.7
phi-4-int4_asym-awq-se-ov 0.30 25.27 5.32 8.1
DeepSeek-R1-Distill-Qwen-14B-int4_sym-awq-se-ov 0.42 25.23 1.56 8.4
Mistral-Small-24B-Instruct-2501-int4_asym-ov 0.36 18.81 7.11 12.9

CPU Performance: Xeon W-2255

Model Prompt Processing (sec) Throughput (t/sec) Duration (sec) Size (GB)
Phi-4-mini-instruct-int4_asym-gptq-ov 1.02 20.44 7.23 2.3
Hermes-3-Llama-3.2-3B-int4_sym-awq-se-ov 1.06 23.66 3.01 1.8
Llama-3.1-Nemotron-Nano-8B-v1-int4_sym-awq-se-ov 2.53 13.22 12.14 4.7
phi-4-int4_asym-awq-se-ov 4 6.63 23.14 8.1
DeepSeek-R1-Distill-Qwen-14B-int4_sym-awq-se-ov 5.02 7.25 11.09 8.4
Mistral-Small-24B-Instruct-2501-int4_asym-ov 6.88 4.11 37.5 12.9
Nous-Hermes-2-Mixtral-8x7B-DPO-int4-sym-se-ov 15.56 6.67 34.60 24.2

Currently implemented Optimum-Intel classes:

These dictate what types models, architectures and tasks are well supported by OpenArc.

OVModelForCausalLM

OVModelForVisualCausalLM

If you are interested in implementing support for another task join our Discord and let me know; we can discuss.

Resources


Learn more about how to leverage your Intel devices for Machine Learning:

openvino_notebooks

Inference with Optimum-Intel

Optimum-Intel Transformers

NPU Devices

Acknowledgments

OpenArc stands on the shoulders of several other projects:

Optimum-Intel

OpenVINO

OpenVINO GenAI

Transformers

FastAPI

click

rich-click

Thank for yoru work!!