Detect Anything via Next Point Prediction

Rex-Omni is a 3B-parameter Multimodal Large Language Model (MLLM) that redefines object detection and a wide range of other visual perception tasks as a simple next-token prediction problem.

News 🎉

[2025-10-31] We release the AWQ quantized version of Rex-Omni, which saves 50% of the storage space. Rex-Omni-AWQ
[2025-10-29] Fine-tuning code is now available.
[2025-10-17] Evaluation code and dataset is now available.
[2025-10-15] Rex-Omni is released.

from PIL import Image
from rex_omni import RexOmniWrapper, RexOmniVisualize

# 1) Initialize the wrapper (model loads internally)
rex = RexOmniWrapper(
    model_path="IDEA-Research/Rex-Omni",   # HF repo or local path
    backend="transformers",                # or "vllm" for high-throughput inference
    # Inference/generation controls (applied across backends)
    max_tokens=2048,
    temperature=0.0,
    top_p=0.05,
    top_k=1,
    repetition_penalty=1.05,
)

# If you are using the AWQ quantized version of Rex-Omni, you can use the following code:
rex = RexOmniWrapper(
    model_path="IDEA-Research/Rex-Omni-AWQ",
    backend="vllm",
    quantization="awq",
    max_tokens=2048,
    temperature=0.0,
    top_p=0.05,
    top_k=1,
    repetition_penalty=1.05,
)

# 2) Prepare input
image = Image.open("tutorials/detection_example/test_images/cafe.jpg").convert("RGB")
categories = [
    "man", "woman", "yellow flower", "sofa", "robot-shope light",
    "blanket", "microwave", "laptop", "cup", "white chair", "lamp",
]

# 3) Run detection
results = rex.inference(images=image, task="detection", categories=categories)
result = results[0]

# 4) Visualize
vis = RexOmniVisualize(
    image=image,
    predictions=result["extracted_predictions"],
    font_size=20,
    draw_width=5,
    show_labels=True,
)
vis.save("tutorials/detection_example/test_images/cafe_visualize.jpg")

Initialization parameters (RexOmniWrapper)

model_path: Hugging Face repo ID or a local checkpoint directory for the Rexe-Omni model.
backend: "transformers" or "vllm".
- transformers: easy to use, good baseline latency.
- vllm: high-throughput, low-latency inference. Requires the vllm package and a compatible environment.
max_tokens: Maximum number of tokens to generate for each output.
temperature: Sampling temperature; higher values increase randomness (0.0 = deterministic/greedy).
top_p: Nucleus sampling parameter; model samples from the smallest set of tokens with cumulative probability ≥ top_p.
top_k: Top-k sampling; restricts sampling to the k most likely tokens.
repetition_penalty: Penalizes repeated tokens; >1.0 discourages repetition.
Optional advanced settings (supported via kwargs when constructing the wrapper):
- Transformers: torch_dtype, attn_implementation, device_map, trust_remote_code, etc.
- VLLM: tokenizer_mode, limit_mm_per_prompt, max_model_len, gpu_memory_utilization, tensor_parallel_size, trust_remote_code, etc.

Inference parameters (rex.inference)

images: A single PIL.Image.Image or a list of images for batch inference.
task: One of "detection", "pointing", "visual_prompting", "keypoint", "ocr_box", "ocr_polygon", "gui_grounding", "gui_pointing".
categories: List of category names/phrases to detect or extract, e.g., ["person", "cup"]. Used to build task prompts.
**keypoint_type": Type of keypoints for keypoint detection task. Options: "person", "hand", "animal"
visual_prompt_boxes: Reference bounding boxes for visual prompting task. Format: [[x0, y0, x1, y1], ...] in absolute coordinates

Returns a list of dictionaries (one per input image). Each dictionary includes:

raw_output: The raw text generated by the LLM.
extracted_predictions: Structured predictions parsed from the raw output, grouped by category.
- For detection: {category: [{"type": "box", "coords": [x0,y0,x1,y1]}, ...], ...}
- For pointing: {category: [{"type": "point", "coords": [x0,y0]}, ...], ...}
- For polygon: {category: [{"type": "polygon", "coords": [x0,y0, ...]}, ...], ...}
- For keypointing: Structured Json

Tips:

For best performance with VLLM, set backend="vllm" and tune gpu_memory_utilization and tensor_parallel_size according to your GPUs.

3. Cookbooks

We provide comprehensive tutorials for each supported task. Each tutorial includes both standalone Python scripts and interactive Jupyter notebooks.

Task	Applications	Python Example	Notebook
Detection	`object detection`	code	notebook
	`object referring`	code	notebook
	`gui grounding`	code	notebook
	`layout grounding`	code	notebook
Pointing	`object pointing`	code	notebook
	`gui pointing`	code	notebook
	`affordance pointing`	code	notebook
Visual prompting	`visual prompting`	code	notebook
OCR	`ocr word box`	code	notebook
	`ocr textline box`	code	notebook
	`ocr polygon`	code	notebook
Keypointing	`person keypointing`	code	notebook
	`animal keypointing`	code	notebook
Other	`batch inference`	code

4. Applications of Rex-Omni

Rex-Omni's unified detection framework enables seamless integration with other vision models.

Application	Description	Demo	Documentation
Rex-Omni + SAM	Combine language-driven detection with pixel-perfect segmentation. Rex-Omni detects objects → SAM generates precise masks		README
Grounding Data Engine	Automatically generate phrase grounding annotations from image captions using spaCy and Rex-Omni.		README

5. Gradio Demo

We provide an interactive Gradio demo that allows you to test all Rex-Omni capabilities through a web interface.

Quick Start

# Launch the demo
CUDA_VISIBLE_DEVICES=0 python demo/gradio_demo.py --model_path IDEA-Research/Rex-Omni

# With custom settings
CUDA_VISIBLE_DEVICES=0 python demo/gradio_demo.py \
    --model_path IDEA-Research/Rex-Omni \
    --backend vllm \
    --server_name 0.0.0.0 \
    --server_port 7890

Available Options

--model_path: Model path or HuggingFace repo ID (default: "IDEA-Research/Rex-Omni")
--backend: Backend to use - "transformers" or "vllm" (default: "transformers")
--server_name: Server host address (default: "192.168.81.138")
--server_port: Server port (default: 5211)
--temperature: Sampling temperature (default: 0.0)
--top_p: Nucleus sampling parameter (default: 0.05)
--max_tokens: Maximum tokens to generate (default: 2048)

6. Evaluation

Please refer to Evaluation for more details.

7. Fine-tuning Rex-Omni

Please refer to Fine-tuning Rex-Omni for more details.

8. LICENSE

Rex-Omni is licensed under the IDEA License 1.0, Copyright (c) IDEA. All Rights Reserved. This model is based on Qwen, which is licensed under the Qwen RESEARCH LICENSE AGREEMENT, Copyright (c) Alibaba Cloud. All Rights Reserved.

9. Citation

Rex-Omni comes from a series of prior works. If you’re interested, you can take a look.

@misc{jiang2025detectpointprediction,
      title={Detect Anything via Next Point Prediction}, 
      author={Qing Jiang and Junan Huo and Xingyu Chen and Yuda Xiong and Zhaoyang Zeng and Yihao Chen and Tianhe Ren and Junzhi Yu and Lei Zhang},
      year={2025},
      eprint={2510.12798},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2510.12798}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
applications		applications
assets		assets
demo		demo
evaluation		evaluation
finetuning		finetuning
rex_omni		rex_omni
tutorials		tutorials
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Detect Anything via Next Point Prediction

News 🎉

Table of Contents

TODO LIST 📝

1. Installation ⛳️

2. Quick Start: Using Rex-Omni for Detection

Initialization parameters (RexOmniWrapper)

Inference parameters (rex.inference)

3. Cookbooks

4. Applications of Rex-Omni

5. Gradio Demo

Quick Start

Available Options

6. Evaluation

7. Fine-tuning Rex-Omni

8. LICENSE

9. Citation

About

Uh oh!

Releases

Packages

Languages

License

IDEA-Research/Rex-Omni

Folders and files

Latest commit

History

Repository files navigation

Detect Anything via Next Point Prediction

News 🎉

Table of Contents

TODO LIST 📝

1. Installation ⛳️

2. Quick Start: Using Rex-Omni for Detection

Initialization parameters (RexOmniWrapper)

Inference parameters (rex.inference)

3. Cookbooks

4. Applications of Rex-Omni

5. Gradio Demo

Quick Start

Available Options

6. Evaluation

7. Fine-tuning Rex-Omni

8. LICENSE

9. Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages