Rex-Omni is a 3B-parameter Multimodal Large Language Model (MLLM) that redefines object detection and a wide range of other visual perception tasks as a simple next-token prediction problem.
- [2025-10-31] We release the AWQ quantized version of Rex-Omni, which saves 50% of the storage space. Rex-Omni-AWQ
- [2025-10-29] Fine-tuning code is now available.
- [2025-10-17] Evaluation code and dataset is now available.
- [2025-10-15] Rex-Omni is released.
- News 🎉
- Table of Contents
- Add Evaluation Code
- Add Fine-tuning Code
- Add Quantilized Rex-Omni
conda create -n rexomni -m python=3.10
pip install torch==2.6.0 torchvision==0.21.0 --index-url https://download.pytorch.org/whl/cu124
git clone https://github.com/IDEA-Research/Rex-Omni.git
cd Rex-Omni
pip install -v -e .Test Installation
CUDA_VISIBLE_DEVICES=1 python tutorials/detection_example/detection_example.pyIf the installation is successful, you will find a visualization of the detection results at tutorials/detection_example/test_images/cafe_visualize.jpg
Below is a minimal example showing how to run object detection using the rex_omni package.
from PIL import Image
from rex_omni import RexOmniWrapper, RexOmniVisualize
# 1) Initialize the wrapper (model loads internally)
rex = RexOmniWrapper(
model_path="IDEA-Research/Rex-Omni", # HF repo or local path
backend="transformers", # or "vllm" for high-throughput inference
# Inference/generation controls (applied across backends)
max_tokens=2048,
temperature=0.0,
top_p=0.05,
top_k=1,
repetition_penalty=1.05,
)
# If you are using the AWQ quantized version of Rex-Omni, you can use the following code:
rex = RexOmniWrapper(
model_path="IDEA-Research/Rex-Omni-AWQ",
backend="vllm",
quantization="awq",
max_tokens=2048,
temperature=0.0,
top_p=0.05,
top_k=1,
repetition_penalty=1.05,
)
# 2) Prepare input
image = Image.open("tutorials/detection_example/test_images/cafe.jpg").convert("RGB")
categories = [
"man", "woman", "yellow flower", "sofa", "robot-shope light",
"blanket", "microwave", "laptop", "cup", "white chair", "lamp",
]
# 3) Run detection
results = rex.inference(images=image, task="detection", categories=categories)
result = results[0]
# 4) Visualize
vis = RexOmniVisualize(
image=image,
predictions=result["extracted_predictions"],
font_size=20,
draw_width=5,
show_labels=True,
)
vis.save("tutorials/detection_example/test_images/cafe_visualize.jpg")- model_path: Hugging Face repo ID or a local checkpoint directory for the Rexe-Omni model.
- backend: "transformers" or "vllm".
- transformers: easy to use, good baseline latency.
- vllm: high-throughput, low-latency inference. Requires the
vllmpackage and a compatible environment.
- max_tokens: Maximum number of tokens to generate for each output.
- temperature: Sampling temperature; higher values increase randomness (0.0 = deterministic/greedy).
- top_p: Nucleus sampling parameter; model samples from the smallest set of tokens with cumulative probability ≥ top_p.
- top_k: Top-k sampling; restricts sampling to the k most likely tokens.
- repetition_penalty: Penalizes repeated tokens; >1.0 discourages repetition.
- Optional advanced settings (supported via kwargs when constructing the wrapper):
- Transformers:
torch_dtype,attn_implementation,device_map,trust_remote_code, etc. - VLLM:
tokenizer_mode,limit_mm_per_prompt,max_model_len,gpu_memory_utilization,tensor_parallel_size,trust_remote_code, etc.
- Transformers:
- images: A single
PIL.Image.Imageor a list of images for batch inference. - task: One of
"detection","pointing","visual_prompting","keypoint","ocr_box","ocr_polygon","gui_grounding","gui_pointing". - categories: List of category names/phrases to detect or extract, e.g.,
["person", "cup"]. Used to build task prompts. - **keypoint_type": Type of keypoints for keypoint detection task. Options: "person", "hand", "animal"
- visual_prompt_boxes: Reference bounding boxes for visual prompting task. Format: [[x0, y0, x1, y1], ...] in absolute coordinates
Returns a list of dictionaries (one per input image). Each dictionary includes:
- raw_output: The raw text generated by the LLM.
- extracted_predictions: Structured predictions parsed from the raw output, grouped by category.
- For detection:
{category: [{"type": "box", "coords": [x0,y0,x1,y1]}, ...], ...} - For pointing:
{category: [{"type": "point", "coords": [x0,y0]}, ...], ...} - For polygon:
{category: [{"type": "polygon", "coords": [x0,y0, ...]}, ...], ...} - For keypointing: Structured Json
- For detection:
Tips:
- For best performance with VLLM, set
backend="vllm"and tunegpu_memory_utilizationandtensor_parallel_sizeaccording to your GPUs.
We provide comprehensive tutorials for each supported task. Each tutorial includes both standalone Python scripts and interactive Jupyter notebooks.
| Task | Applications | Demo | Python Example | Notebook |
|---|---|---|---|---|
| Detection | object detection |
![]() |
code | notebook |
object referring |
![]() |
code | notebook | |
gui grounding |
![]() |
code | notebook | |
layout grounding |
![]() |
code | notebook | |
| Pointing | object pointing |
![]() |
code | notebook |
gui pointing |
![]() |
code | notebook | |
affordance pointing |
![]() |
code | notebook | |
| Visual prompting | visual prompting |
![]() |
code | notebook |
| OCR | ocr word box |
![]() |
code | notebook |
ocr textline box |
![]() |
code | notebook | |
ocr polygon |
![]() |
code | notebook | |
| Keypointing | person keypointing |
![]() |
code | notebook |
animal keypointing |
![]() |
code | notebook | |
| Other | batch inference |
code |
Rex-Omni's unified detection framework enables seamless integration with other vision models.
| Application | Description | Demo | Documentation |
|---|---|---|---|
| Rex-Omni + SAM | Combine language-driven detection with pixel-perfect segmentation. Rex-Omni detects objects → SAM generates precise masks | ![]() |
README |
| Grounding Data Engine | Automatically generate phrase grounding annotations from image captions using spaCy and Rex-Omni. | ![]() |
README |
We provide an interactive Gradio demo that allows you to test all Rex-Omni capabilities through a web interface.
# Launch the demo
CUDA_VISIBLE_DEVICES=0 python demo/gradio_demo.py --model_path IDEA-Research/Rex-Omni
# With custom settings
CUDA_VISIBLE_DEVICES=0 python demo/gradio_demo.py \
--model_path IDEA-Research/Rex-Omni \
--backend vllm \
--server_name 0.0.0.0 \
--server_port 7890--model_path: Model path or HuggingFace repo ID (default: "IDEA-Research/Rex-Omni")--backend: Backend to use - "transformers" or "vllm" (default: "transformers")--server_name: Server host address (default: "192.168.81.138")--server_port: Server port (default: 5211)--temperature: Sampling temperature (default: 0.0)--top_p: Nucleus sampling parameter (default: 0.05)--max_tokens: Maximum tokens to generate (default: 2048)
Please refer to Evaluation for more details.
Please refer to Fine-tuning Rex-Omni for more details.
Rex-Omni is licensed under the IDEA License 1.0, Copyright (c) IDEA. All Rights Reserved. This model is based on Qwen, which is licensed under the Qwen RESEARCH LICENSE AGREEMENT, Copyright (c) Alibaba Cloud. All Rights Reserved.
Rex-Omni comes from a series of prior works. If you’re interested, you can take a look.
@misc{jiang2025detectpointprediction,
title={Detect Anything via Next Point Prediction},
author={Qing Jiang and Junan Huo and Xingyu Chen and Yuda Xiong and Zhaoyang Zeng and Yihao Chen and Tianhe Ren and Junzhi Yu and Lei Zhang},
year={2025},
eprint={2510.12798},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2510.12798},
}

















