Awesome-Rubrics

A curated reading list on rubric-based evaluation, reward modeling, and post-training for large models.
Rubrics turn expert judgment into structured criteria, auditable LLM judges, and trainable reward signals.

What are Rubrics? · Why Rubrics Matter? · Repository Map · Table of Contents

Papers with publicly released code or project resources include an inline [[Code](...)] link. Entries without verified repositories omit that link.

Contributions are welcome. If you find missing papers, inaccurate classifications, or newly released code, feel free to update this list.

What are Rubrics?

In the context of LLM evaluation and alignment, a rubric is a structured set of criteria for judging open-ended model outputs. Instead of asking a human or LLM judge for one vague preference, rubrics decompose quality into explicit dimensions, scoring rules, and evidence requirements.

Rubrics make subjective judgment more inspectable:

What to judge: relevance, factuality, completeness, safety, reasoning quality, style, or domain-specific standards.
How to judge: score levels, checklists, pairwise criteria, evidence anchors, or weighted dimensions.
How to use the judgment: evaluation reports, LLM-as-a-Judge protocols, reward models, preference tuning, policy optimization, and curriculum learning.

Figure 1. Rubrics convert coarse feedback into fine-grained, inspectable reward signals.

Feedback style	Typical signal	Best fit	Main limitation
RLHF / model-based preference	"Output A is better than output B."	Open-ended comparison	Coarse and hard to inspect
RLVR / rule-based reward	Format is correct, answer matches, reasoning token appears, list structure exists	Verifiable tasks	Too rigid for subjective or open-ended tasks
Rubric-based feedback	Relevance, completeness, clarity, safety, each scored separately	Open-ended evaluation and training	Requires careful design and calibration

Rubrics are the middle layer: more structured than model-only preference, more flexible than hard rules.

Why Rubrics Matter Now

"In this new era, evaluation becomes more important than training."

Shunyu Yao, The Second Half (2025)

As large models move from closed-form QA to open-ended reasoning, agents, multimodal generation, and professional domains, progress is increasingly bottlenecked by evaluation and feedback design. Training can optimize only what the system can measure, and many important tasks cannot be reduced to a single scalar reward.

Rubrics help answer	Why it matters
What counts as good behavior?	They define explicit criteria, scoring boundaries, and failure modes.
How can expert judgment scale?	They convert tacit standards into reusable evaluation instructions and datasets.
How can LLM-as-a-Judge become less opaque?	Judges can be required to expose criteria, evidence, scores, and rationales.
How does evaluation become training signal?	Rubric-level feedback can supervise SFT, preference tuning, policy optimization, reward modeling, and curriculum learning.

Rubrics therefore act as a bridge between human standards and machine-optimizable signals. They are not merely annotation templates; they are a control surface for evaluation, reward modeling, and post-training.

Growing Research Momentum

Figure 2. The number of rubric-related papers has grown rapidly, suggesting increasing research attention to structured evaluation and reward design.

The rising trend shows that rubric-based methods are becoming an increasingly important direction for large-model alignment, especially as evaluation, reward modeling, and post-training move toward more structured and auditable feedback.

From Evaluation to Reward

Evaluation is no longer only a post-hoc metric. It is becoming part of the infrastructure of AI systems:

🧑‍⚖️ Expert Standards → 📋 Rubrics → 📊 Evaluation Signals → 🎯 Rewards → 🔁 Training Dynamics

Rubrics are therefore not just for judging model outputs. They are a way to automate parts of expert feedback: experts define criteria, models apply them at scale, and failures reveal where the rubric or judge must be revised. In this sense, evaluation becomes an executable form of domain knowledge.

A Minimal Rubric Example

For the query:

How can cities encourage more people to use public transport?

a rubric does not directly ask "which answer is better?" It decomposes the judgment:

Component	What the judge checks
Relevance	Does the answer address public transport adoption rather than unrelated urban issues?
Clarity	Is the answer easy to understand and well organized?
Completeness	Does it cover affordability, convenience, infrastructure, reliability, and incentives?
Safety / fairness	Does it avoid harmful, biased, or exclusionary suggestions?

This makes the reward more interpretable, decomposable, and actionable.

Rubric Generation Strategies

Figure 3. Rubric construction paradigms for large model alignment.

Strategy	Core idea	When it is useful
Expert direct annotation	Experts write criteria explicitly.	High-stakes domains and seed rubrics
Induction from expert QA annotations	Criteria are extracted from annotated examples.	Scaling expert knowledge beyond manual templates
Distillation from teacher demonstrations	Rubrics are derived from high-quality model outputs.	Bootstrapping scalable reward signals

Together, these strategies show how rubric construction moves from manual specification toward data-driven induction and model-driven distillation.

Repository Map

This repository is organized as a conceptual map of rubric-related research. We group papers by the role rubrics play in the large-model pipeline.

This organization helps show rubrics not only as evaluation tools, but also as structured interfaces connecting expert standards, feedback data, reward signals, training objectives, and deployment-time assessment.

Section	Role in the repository
Foundations	Introduces what counts as a rubric, how rubric formats differ from preferences, rules, or scalar scores, and why structured criteria become useful in large-model settings.
Data	Covers how rubrics are collected, generated, refined, and organized into reusable supervision signals through human annotation, synthetic generation, expert labeling, and rubric datasets.
Training	Summarizes how rubric-level judgments can be transformed into SFT data, preference objectives, RL rewards, curriculum signals, and self-improvement loops.
Evaluation	Connects rubrics to LLM-as-a-judge protocols, benchmark design, calibration, reliability analysis, and robustness checks, where explicit and auditable criteria are especially important.
Applications	Shows how rubric-based methods extend beyond text QA to multimodal tasks, agent systems, and professional domains that require domain-specific standards.

Overall, this structure follows the lifecycle of rubric-based large-model alignment:

Define criteria → collect or generate rubric data → train with rubric signals → evaluate with structured judges → apply in domain-specific tasks

Rubrics provide a structured layer for connecting data, training, evaluation, and applications.

Foundations of Rubric-Based Evaluation

Rubric Definitions and Boundaries

Rubrics define structured evaluation dimensions, scoring rules, and judgment boundaries for open-ended model outputs. This section covers work that clarifies what counts as a rubric and how rubrics function as judges or reward criteria.

2025

🌟 [arXiv 2025.10] From Implicit Weights to Explicit Rubrics: A Training-Free Framework for Reward Modeling [Code]
[ICLR 26] Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains [Data]

2024

[Blog 2024.11] Reward Hacking in Reinforcement Learning

Rubric Representation and Scoring Schemas

This section focuses on how rubrics are expressed, including dimensions, levels, weights, and scoring templates. It is useful for understanding the representational form that makes rubric-based supervision reusable and controllable.

2026

[arXiv 2026.03] AutoChecklist: Composable Pipelines for Checklist Generation and Scoring with LLM-as-a-Judge

2025

🌟 [arXiv 2025.10] From Implicit Weights to Explicit Rubrics: A Training-Free Framework for Reward Modeling [Code]

Traditional Domain Usage

No retained papers after full-text justification review.

Why Foundation Models Need Rubrics

2025

🌟 [ICLR 26] Chasing the Tail: Effective Rubric-based Reward Modeling for Large Language Model Post-Training [Code]

Rubrics for Foundation-Model Alignment and Evaluation

Rubric Construction and Data Sources

Synthetic Rubric Generation

Synthetic rubric generation uses generated tasks, labels, critiques, or rubric annotations to expand supervision beyond limited human labeling. It is especially useful when rubric-style feedback can be programmatically produced at scale for reward modeling or post-training.

2026

🌟 [arXiv 2026.02] ClinAlign: Scaling Healthcare Alignment from Clinician Preference [Code]
🌟 [arXiv 2026.01] RubricHub: A Comprehensive and Highly Discriminative Rubric Dataset via Automated Coarse-to-Fine Generation [Code]

2025

[arXiv 2025.10] OpenRubrics: Towards Scalable Synthetic Rubric Generation for Reward Modeling and LLM Alignment [Data]
🌟 [ICLR 26] OptimSyn: Influence-Guided Rubrics Optimization for Synthetic Data Generation [Code]
[ICLR 26] Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains [Data]

Human- and Expert-Grounded Rubric Data

Human- and expert-grounded rubric data refers to signals from human preferences, authentic interactions, or domain specialists. These sources are important when alignment targets depend on nuanced standards that are hard to synthesize fully.

2026

🌟 [arXiv 2026.04] Xpertbench: Expert Level Tasks with Rubrics-Based Evaluation [Code]
🌟 [arXiv 2026.03] PresentBench: A Fine-Grained Rubric-Based Benchmark for Slide Generation [Code]
🌟 [arXiv 2026.03] RubricBench: Aligning Model-Generated Rubrics with Human Standards [Code]
🌟 [arXiv 2026.03] PRBench: End-to-end Paper Reproduction in Physics Research [Code]
[arXiv 2026.01] Health-SCORE: Towards Scalable Rubrics for Improving Health-LLMs

2025

[arXiv 2025.10] Benchmarking and Learning Real-World Customer Service Dialogue
🌟 [ICLR 26] MENLO: From Preferences to Proficiency - Evaluating and Modeling Native-like Quality Across 47 Languages [Code]
🌟 [arXiv 2025.05] HealthBench: Evaluating Large Language Models Towards Improved Human Health [Code]
🌟 [arXiv 2025.04] PaperBench: Evaluating AI's Ability to Replicate AI Research [Code]

Rubric-Guided Training and Post-Training

Pre-training

No retained papers after full-text justification review.

Post-training

Rubric-Guided Supervised Fine-Tuning

Rubrics can be used in supervised fine-tuning for filtering data, weighting samples, or imposing structured response preferences. This makes SFT more aligned with multi-dimensional quality targets instead of flat imitation alone.

2025

[ICLR 26] P-GenRM: Personalized Generative Reward Model with Test-time User-based Scaling

Rubric-Guided Preference Tuning

Direct preference-learning methods use rubric signals to construct, weight, or structure preference objectives, including DPO-style and preference-tuning approaches.

2026

🌟 [arXiv 2026.03] AdaRubric: Task-Adaptive Rubrics for LLM Agent Evaluation [Code]

2025

[arXiv 2025.10] OpenRubrics: Towards Scalable Synthetic Rubric Generation for Reward Modeling and LLM Alignment [Model]
[arXiv 2025.08] Are Today's LLMs Ready to Explain Well-Being Concepts?
[ICML-W] Configurable Preference Tuning with Rubric-Guided Synthetic Data

Rubric-Aware Policy Optimization

RL post-training methods modify policy optimization, advantage estimation, exploration, or training stability when rewards are rubric-based or multi-dimensional.

2026

🌟 [arXiv 2026.04] Rubrics to Tokens: Bridging Response-level Rubrics and Token-level Rewards in Instruction Following Tasks [Code]
[arXiv 2026.03] Listening to the Echo: User-Reaction Aware Policy Optimization via Scalar-Verbal Hybrid Reinforcement Learning
[arXiv 2026.03] Alternating Reinforcement Learning with Contextual Rubric Rewards
🌟 [arXiv 2026.03] Experience is the Best Teacher: Motivating Effective Exploration in Reinforcement Learning for LLMs [Code]
🌟 [arXiv 2026.03] PAPO: Stabilizing Rubric Integration Training via Decoupled Advantage Normalization [Code]
[arXiv 2026.02] Alternating Reinforcement Learning for Rubric-Based Reward Modeling in Non-Verifiable LLM Post-Training [Model]
[arXiv 2026.02] Learning to Self-Verify Makes Language Models Better Reasoners
🌟 [arXiv 2026.02] CM2: Reinforcement Learning with Checklist Rewards for Multi-Turn and Multi-Step Agentic Tool Use [Code]
🌟 [arXiv 2026.02] Open Rubric System: Scaling Reinforcement Learning with Pairwise Adaptive Rubric [Code]
🌟 [arXiv 2026.01] Chaining the Evidence: Robust Reinforcement Learning for Deep Search Agents with Citation-Aware Rubric Rewards [Code]
[arXiv 2026.01] GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization [Proj]

2025

[arXiv 2025.08] Pareto Multi-Objective Alignment for Language Models
[arXiv 2025.06] Direct Reasoning Optimization: Constrained RL with Token-Level Dense Reward and Rubric-Gated Constraints for Open-ended Tasks
🌟 [arXiv 2025.08] Breaking the Exploration Bottleneck: Rubric-Scaffolded Reinforcement Learning for General LLM Reasoning [Code]
[arXiv 2025.09] Quantile Advantage Estimation: Stabilizing RLVR for LLM Reasoning
[arXiv 2025.10] Reinforcement Learning for Tool-Integrated Interleaved Thinking towards Cross-Domain Generalization
[arXiv 2025.11] Reward and Guidance through Rubrics: Promoting Exploration to Improve Multi-Domain Reasoning

Rubric-Based Reward Modeling and Signal Design

Methods in this section design, generate, calibrate, or densify rubric-based reward signals, including reward models, LLM judges, checklist rewards, and rubric-to-token supervision.

2026

🌟 [arXiv 2026.05] Auto-Rubric as Reward: From Implicit Preferences to Explicit Multimodal Generative Criteria [Code]
[Tech Report 2026.04] DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence [Model]
🌟 [arXiv 2026.04] Rubrics to Tokens: Bridging Response-level Rubrics and Token-level Rewards in Instruction Following Tasks [Code]
🌟 [arXiv 2026.03] CDRRM: Contrast-Driven Rubric Generation for Reliable and Interpretable Reward Modeling [Code]
🌟 [arXiv 2026.03] AutoChecklist: Composable Pipelines for Checklist Generation and Scoring with LLM-as-a-Judge [Code]
[arXiv 2026.03] RubricRAG: Towards Interpretable and Reliable LLM Evaluation via Domain Knowledge Retrieval for Rubric Generation
[arXiv 2026.02] Alternating Reinforcement Learning for Rubric-Based Reward Modeling in Non-Verifiable LLM Post-Training [Model]
🌟 [arXiv 2026.02] Open Rubric System: Scaling Reinforcement Learning with Pairwise Adaptive Rubric [Code]
🌟 [arXiv 2026.02] OMNI-RRM: Advancing Omni Reward Model [Code]
[arXiv 2026.02] Grad2Reward: From Sparse Judgment to Dense Rewards for Improving Open-Ended LLM Reasoning
[arXiv 2026.02] Learning Query-Specific Rubrics from Human Preferences for DeepResearch Report Generation
[arXiv 2026.02] Rethinking Rubric Generation for Improving LLM Judge and Reward Modeling for Open-ended Tasks
[arXiv 2026.02] Features as Rewards: Scalable Supervision for Open-Ended Tasks via Interpretability
[arXiv 2026.02] SibylSense: Adaptive Rubric Learning via Memory Tuning and Adversarial Probing
🌟 [arXiv 2026.01] RubricHub: A Comprehensive and Highly Discriminative Rubric Dataset via Automated Coarse-to-Fine Generation [Code]
🌟 [arXiv 2026.01] Reward Modeling for Scientific Writing Evaluation [Code]
🌟 [arXiv 2026.01] P-Check: Advancing Personalized Reward Models via Learning to Generate Dynamic Checklists [Code]
[arXiv 2026.01] Rewarding Creativity: A Human-Aligned Generative Reward Model for Reinforcement Learning in Storytelling

2025

🌟 [arXiv 2025.12] TableGPT-R1: Advancing Tabular Reasoning Through Reinforcement Learning [Code]
🌟 [arXiv 2025.11] AdvancedIF: Rubric-Based Benchmarking and Reinforcement Learning for Advancing LLM Instruction Following [Code]
🌟 [arXiv 2025.10] From Implicit Weights to Explicit Rubrics: A Training-Free Framework for Reward Modeling [Code]
[arXiv 2025.10] OpenRubrics: Towards Scalable Synthetic Rubric Generation for Reward Modeling and LLM Alignment [Data] [Model]
[ICLR 26] RLAC: Reinforcement Learning with Adversarial Critic for Free-Form Generation Tasks [Proj]
[arXiv 2025.08] Reinforcement Learning with Rubric Anchors
🌟 [arXiv 2025.06] AutoRule: Reasoning Chain-of-Thought Extracted Rule-Based Rewards Improve Preference Learning [Code]
[ICLR 26] Robust Reward Modeling via Causal Rubrics
🌟 [NeurIPS 25] Checklists Are Better Than Reward Models For Aligning Language Models [Code]
🌟 [arXiv 2025.05] R3: Robust Rubric-Agnostic Reward Models [Code]

Rubric-Structured Curriculum Learning

Curriculum learning studies how rubric dimensions or difficulty levels can stage training over time. It is relevant when structured feedback is used not only to score outputs but also to organize learning progression.

2026

[arXiv 2026.02] RuCL: Stratified Rubric-Based Curriculum Learning for Multimodal Large Language Model Reasoning

2025

🌟 [ICLR 26] P-GenRM: Personalized Generative Reward Model with Test-time User-based Scaling [Code]

Rubric-Guided Self-Improvement

2026

🌟 [arXiv 2026.02] Reinforcing Chain-of-Thought Reasoning with Self-Evolving Rubrics [Code]

Rubric-Based Evaluation

Evaluation methods focus on how rubrics are used to judge outputs reliably and consistently across tasks. This includes LLM-as-a-judge settings, rubric-aware reward reasoning, and methods that improve interpretability of evaluation.

LLM-as-a-Judge and Reward Reasoning

2026

[arXiv 2026.03] RubricRAG: Towards Interpretable and Reliable LLM Evaluation via Domain Knowledge Retrieval for Rubric Generation [Data]
[arXiv 2026.03] Beyond the Illusion of Consensus: From Surface Heuristics to Knowledge-Grounded Evaluation in LLM-as-a-Judge
🌟 [arXiv 2026.03] CDRRM: Contrast-Driven Rubric Generation for Reliable and Interpretable Reward Modeling [Code]
🌟 [arXiv 2026.03] AutoChecklist: Composable Pipelines for Checklist Generation and Scoring with LLM-as-a-Judge [Code]
[arXiv 2026.03] Examining Reasoning LLMs-as-Judges in Non-Verifiable LLM Post-Training
🌟 [arXiv 2026.03] AdaRubric: Task-Adaptive Rubrics for LLM Agent Evaluation [Code]
[arXiv 2026.02] Rethinking Rubric Generation for Improving LLM Judge and Reward Modeling for Open-ended Tasks
🌟 [arXiv 2026.02] Rubrics as an Attack Surface: Stealthy Preference Drift in LLM Judges [Code]
🌟 [arXiv 26.01] RULERS: Locked Rubrics and Evidence-Anchored Scoring for Robust LLM Evaluation [Code]
🌟 [ICLR 26] mR3: Multilingual Rubric-Agnostic Reward Reasoning Models [Code]
[ICLR 26] RM-R1: Reward Modeling as Reasoning [Proj]
[ICLR 26] MENLO: From Preferences to Proficiency - Evaluating and Modeling Native-like Quality Across 47 Languages [Data]
🌟 [ICLR 26] Retro: Optimizing LLMs for Reasoning-Intensive Document Retrieval [Code]

2025

🌟 [arXiv 2025.10] From Implicit Weights to Explicit Rubrics: A Training-Free Framework for Reward Modeling [Code]
🌟 [arXiv 2025.05] R3: Robust Rubric-Agnostic Reward Models [Code]
🌟 [arXiv 2025.01] SedarEval: Automated Evaluation using Self-Adaptive Rubrics [Code]

Statistical and Uncertainty-Aware Evaluation

2025

🌟 [ICLR 26] Don't Pass@k: A Bayesian Framework for Large Language Model Evaluation [Code]

Rubric-Based Evaluation Benchmarks

Benchmark work provides datasets and tasks where rubric-based evaluation can be compared, stress-tested, and standardized. These resources are important for measuring whether rubric-trained or rubric-judged systems generalize across realistic scenarios.

2026

🌟 [arXiv 2026.03] RubricBench: Aligning Model-Generated Rubrics with Human Standards [Code]
🌟 [arXiv 2026.03] Is this Idea Novel? An Automated Benchmark for Judgment of Research Ideas [Code]
[arXiv 2026.03] Beyond Binary Correctness: Scaling Evaluation of Long-Horizon Agents on Subjective Enterprise Tasks
[arXiv 2026.03] MiroEval: Benchmarking Multimodal Deep Research Agents in Process and Outcome [Proj]
[arXiv 2026.03] Xpertbench: Expert Level Tasks with Rubrics-Based Evaluation [Proj]
[arXiv 2026.03] $OneMillion-Bench: How Far are Language Agents from Human Experts?
🌟 [arXiv 2026.02] LiveMedBench: A Contamination-Free Medical Benchmark for LLMs with Automated Rubric Evaluation [Code]
[arXiv 2026.02] When and What to Ask: AskBench and Rubric-Guided RLVR for LLM Clarification
🌟 [arXiv 2026.01] PLAW BENCH : A Rubric-Based Benchmark for Evaluating LLMs in Real-World Legal Practice [Code]
[arXiv 2026.01] Frontier Science: Evaluating AI's Ability to Perform Expert-Level Scientific Tasks [Data]
[arXiv 2026.01] UEval: A Benchmark for Unified Multimodal Generation [Proj]

2025

[arXiv 2025.11] RESEARCH RUBRICS : A Benchmark of Prompts and Rubrics For Evaluating Deep Research Agents [Proj]
[arXiv 2025.11] Evaluating Legal Reasoning Traces with Legal Issue Tree Rubrics
🌟 [arXiv 2025.11] AdvancedIF: Rubric-Based Benchmarking and Reinforcement Learning for Advancing LLM Instruction Following [Code]
[arXiv 2025.10] GDPVAL : EVALUATING AI MODEL PERFORMANCE ON REAL-WORLD ECONOMICALLY VALUABLE TASKS [Data]
[arXiv 2025.10] MOREBENCH : EVALUATING PROCEDURAL AND PLoReBench: Evaluating Procedural and Pluralistic Moral Reasoning in Language Models, More than Outcomes [Proj]
🌟 [ICLR 26] ProfBench: Multi-Domain Rubrics requiring Professional Knowledge to Answer and Judge [Code]
[ICLR 26] ExpertLongBench: Benchmarking Language Models on Expert-Level Long-Form Generation Tasks with Structured Checklists [Proj]
🌟 [arXiv 2025.07] Generalizing Verifiable Instruction Following [Code]
🌟 [arXiv 2025.05] HealthBench: Evaluating Large Language Models Towards Improved Human Health [Code]
🌟 [arXiv 2025.04] PaperBench: Evaluating AI's Ability to Replicate AI Research [Code]

Application Settings of Rubrics

Applications grouped by modality and domain, highlighting where rubrics help capture quality, safety, and task completion.

Rubrics Across Modalities

Text Modality

This section covers rubric use in text generation, dialogue, and reasoning-heavy language tasks. The emphasis is on how structured criteria guide evaluation or training for open-ended textual outputs.

2026

🌟 [arXiv 2026.02] When and What to Ask: AskBench and Rubric-Guided RLVR for LLM Clarification [Code]
[arXiv 2026.01] Rewarding Creativity: A Human-Aligned Generative Reward Model for Reinforcement Learning in Storytelling

2025

[arXiv 2025.12] Evaluating Legal Reasoning Traces with Legal Issue Tree Rubrics
[arXiv 2025.10] Benchmarking and Learning Real-World Customer Service Dialogue
[arXiv 2025.08] Are Today's LLMs Ready to Explain Well-Being Concepts?
[ICLR 26] QuRL: Rubrics As Judge For Open-Ended Question Answering
🌟 [ICLR 26] The CoT Encyclopedia: Analyzing, Predicting, and Controlling how a Reasoning Model will Think [Code]

Vision Modality

Visual rubric work extends structured judging and reward design to images, videos, and vision-language tasks. It is useful when model quality depends on multiple perceptual and semantic dimensions rather than a single scalar objective.

2026

[arXiv 2026.03] RubiCap: Rubric-Guided Reinforcement Learning for Dense Image Captioning
🌟 [arXiv 2026.03] Rationale Matters: Learning Transferable Rubrics via Proxy-Guided Critique for VLM Reward Models [Code]
[arXiv 2026.03] When Rubrics Fail: Error Enumeration as Reward in Reference-Free RL Post-Training for Virtual Try-On
[arXiv 2026.01] SOCIAL CAPTION: Evaluating Social Understanding in Multimodal Models

2025

[arXiv 2025.11] RubricRL: Simple Generalizable Rewards for Text-to-Image Generation
[arXiv 2025.10] Beyond Seeing: Evaluating Multimodal LLMs on Tool-Enabled Image Perception [Proj]
🌟 [ICLR 26] MicroVerse: A Preliminary Exploration Toward a Micro-World Simulation [Code]

Audio Modality

No retained papers after full-text justification review.

Rubrics Across Domains

Medical

Medical applications use rubrics to capture expert standards, safety expectations, and multi-step clinical reasoning quality. This is important because medical evaluation often cannot be reduced to single-answer correctness.

2026

🌟 [arXiv 2026.03] QuarkMedBench: A Real-World Scenario Driven Benchmark for Evaluating Large Language Models [Code]
[arXiv 2026.03] MedMT-Bench: Can LLMs Memorize and Understand Long Multi-Turn Conversations in Medical Scenarios?
🌟 [arXiv 2026.02] ClinAlign: Scaling Healthcare Alignment from Clinician Preference [Code]
[arXiv 2026.02] LiveMedBench: A Contamination-Free Medical Benchmark for LLMs with Automated Rubric Evaluation
[arXiv 2026.02] Quark Medical Alignment: A Holistic Multi-Dimensional Alignment and Collaborative Optimization Paradigm
[arXiv 2026.01] RubRIX: Rubric-Driven Risk Mitigation in Caregiver-AI Interactions
[arXiv 2026.01] Health-SCORE: Towards Scalable Rubrics for Improving Health-LLMs

2025

🌟 [arXiv 2025.10] InfiMed-ORBIT: Aligning LLMs on Open-Ended Complex Tasks via Rubric-Based Incremental Training [Code]
[arXiv 2025.09] Baichuan-M2: Scaling Medical Capability with Large Verifier System
🌟 [arXiv 2025.05] HealthBench: Evaluating Large Language Models Towards Improved Human Health [Code]

Software Engineering and Code Agents

Code-domain rubric work studies structured evaluation for coding, debugging, and software-agent behavior.

2026

[arXiv 2026.01] Agentic Rubrics as Contextual Verifiers for SWE Agents

Agentic Tasks

Agent settings require rubrics to evaluate long-horizon behavior, tool use, planning, and subjective task completion. This section highlights work where structured criteria are central to assessing or training interactive agents.

2026

🌟 [arXiv 2026.04] Xpertbench: Expert Level Tasks with Rubrics-Based Evaluation [Code]
🌟 [arXiv 2026.03] PresentBench: A Fine-Grained Rubric-Based Benchmark for Slide Generation [Code]
🌟 [arXiv 2026.03] AdaRubric: Task-Adaptive Rubrics for LLM Agent Evaluation [Code]
[arXiv 2026.03] Beyond Binary Correctness: Scaling Evaluation of Long-Horizon Agents on Subjective Enterprise Tasks
🌟 [arXiv 2026.03] PRBench: End-to-end Paper Reproduction in Physics Research [Code]
[arXiv 2026.02] Learning Query-Specific Rubrics from Human Preferences for DeepResearch Report Generation
🌟 [arXiv 2026.02] When and What to Ask: AskBench and Rubric-Guided RLVR for LLM Clarification [Code]
🌟 [arXiv 2026.02] CM2: Reinforcement Learning with Checklist Rewards for Multi-Turn and Multi-Step Agentic Tool Use [Code]
🌟 [arXiv 2026.01] Technical Report Tongyi DeepResearch [Code]
🌟 [arXiv 2026.01] Chaining the Evidence: Robust Reinforcement Learning for Deep Search Agents with Citation-Aware Rubric Rewards [Code]
🌟 [arXiv 2026.01] Mock Worlds, Real Skills: Building Small Agentic Language Models with Synthetic Tasks, Simulated Environments, and Rubric-Based Rewards [Code]

2025

[arXiv 2025.12] ARCANE: A Multi-Agent Framework for Interpretable and Configurable Alignment
🌟 [arXiv 2025.12] Step-DeepResearch Technical Report [Code]
[arXiv 2025.10] Reinforcement Learning for Tool-Integrated Interleaved Thinking towards Cross-Domain Generalization
[arXiv 2025.10] Beyond Seeing: Evaluating Multimodal LLMs on Tool-Enabled Image Perception
[NeurIPS 25-W] Towards Real-World Evaluation of Agentic Work in Freelance Marketplaces

LICENSE

This project is licensed under the MIT License - see the LICENSE file for details.

Contact

If you have any questions or suggestions, please feel free to contact Hongru Xiao.

Name		Name	Last commit message	Last commit date
Latest commit History 71 Commits
utils		utils
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Awesome-Rubrics

What are Rubrics?

Why Rubrics Matter Now

Growing Research Momentum

From Evaluation to Reward

A Minimal Rubric Example

Rubric Generation Strategies

Repository Map

Table of Contents

Foundations of Rubric-Based Evaluation

Rubric Definitions and Boundaries

2025

2024

Rubric Representation and Scoring Schemas

2026

2025

Traditional Domain Usage

Why Foundation Models Need Rubrics

2025

Rubrics for Foundation-Model Alignment and Evaluation

Rubric Construction and Data Sources

Synthetic Rubric Generation

2026

2025

Human- and Expert-Grounded Rubric Data

2026

2025

Rubric-Guided Training and Post-Training

Pre-training

Post-training

Rubric-Guided Supervised Fine-Tuning

2025

Rubric-Guided Preference Tuning

2026

2025

Rubric-Aware Policy Optimization

2026

2025

Rubric-Based Reward Modeling and Signal Design

2026

2025

Rubric-Structured Curriculum Learning

2026

2025

Rubric-Guided Self-Improvement

2026

Rubric-Based Evaluation

LLM-as-a-Judge and Reward Reasoning

2026

2025

Statistical and Uncertainty-Aware Evaluation

2025

Rubric-Based Evaluation Benchmarks

2026

2025

Application Settings of Rubrics

Rubrics Across Modalities

Text Modality

2026

2025

Vision Modality

2026

2025

Audio Modality

Rubrics Across Domains

Medical

2026

2025

Software Engineering and Code Agents

2026

Agentic Tasks

2026

2025

LICENSE

Contact

About

Packages