Skip to content

FreedomIntelligence/Awesome-Rubrics

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

71 Commits
Β 
Β 
Β 
Β 

Repository files navigation

Awesome-Rubrics

Survey Paper Awesome Maintenance PRs Welcome

A curated reading list on rubric-based evaluation, reward modeling, and post-training for large models.
Rubrics turn expert judgment into structured criteria, auditable LLM judges, and trainable reward signals.


What are Rubrics? Β· Why Rubrics Matter? Β· Repository Map Β· Table of Contents

Papers with publicly released code or project resources include an inline [[Code](...)] link. Entries without verified repositories omit that link.

Contributions are welcome. If you find missing papers, inaccurate classifications, or newly released code, feel free to update this list.

What are Rubrics?

In the context of LLM evaluation and alignment, a rubric is a structured set of criteria for judging open-ended model outputs. Instead of asking a human or LLM judge for one vague preference, rubrics decompose quality into explicit dimensions, scoring rules, and evidence requirements.

Rubrics make subjective judgment more inspectable:

  • What to judge: relevance, factuality, completeness, safety, reasoning quality, style, or domain-specific standards.
  • How to judge: score levels, checklists, pairwise criteria, evidence anchors, or weighted dimensions.
  • How to use the judgment: evaluation reports, LLM-as-a-Judge protocols, reward models, preference tuning, policy optimization, and curriculum learning.

Rubrics: from coarse to fine-grained reward signals

Figure 1. Rubrics convert coarse feedback into fine-grained, inspectable reward signals.

Feedback style Typical signal Best fit Main limitation
RLHF / model-based preference "Output A is better than output B." Open-ended comparison Coarse and hard to inspect
RLVR / rule-based reward Format is correct, answer matches, reasoning token appears, list structure exists Verifiable tasks Too rigid for subjective or open-ended tasks
Rubric-based feedback Relevance, completeness, clarity, safety, each scored separately Open-ended evaluation and training Requires careful design and calibration

Rubrics are the middle layer: more structured than model-only preference, more flexible than hard rules.

Why Rubrics Matter Now

"In this new era, evaluation becomes more important than training."

As large models move from closed-form QA to open-ended reasoning, agents, multimodal generation, and professional domains, progress is increasingly bottlenecked by evaluation and feedback design. Training can optimize only what the system can measure, and many important tasks cannot be reduced to a single scalar reward.

Rubrics help answer Why it matters
What counts as good behavior? They define explicit criteria, scoring boundaries, and failure modes.
How can expert judgment scale? They convert tacit standards into reusable evaluation instructions and datasets.
How can LLM-as-a-Judge become less opaque? Judges can be required to expose criteria, evidence, scores, and rationales.
How does evaluation become training signal? Rubric-level feedback can supervise SFT, preference tuning, policy optimization, reward modeling, and curriculum learning.

Rubrics therefore act as a bridge between human standards and machine-optimizable signals. They are not merely annotation templates; they are a control surface for evaluation, reward modeling, and post-training.

Growing Research Momentum

Growing number of rubric-related papers

Figure 2. The number of rubric-related papers has grown rapidly, suggesting increasing research attention to structured evaluation and reward design.

The rising trend shows that rubric-based methods are becoming an increasingly important direction for large-model alignment, especially as evaluation, reward modeling, and post-training move toward more structured and auditable feedback.

From Evaluation to Reward

Evaluation is no longer only a post-hoc metric. It is becoming part of the infrastructure of AI systems:

πŸ§‘β€βš–οΈ Expert Standards β†’ πŸ“‹ Rubrics β†’ πŸ“Š Evaluation Signals β†’ 🎯 Rewards β†’ πŸ” Training Dynamics

Rubrics are therefore not just for judging model outputs. They are a way to automate parts of expert feedback: experts define criteria, models apply them at scale, and failures reveal where the rubric or judge must be revised. In this sense, evaluation becomes an executable form of domain knowledge.

A Minimal Rubric Example

For the query:

How can cities encourage more people to use public transport?

a rubric does not directly ask "which answer is better?" It decomposes the judgment:

Component What the judge checks
Relevance Does the answer address public transport adoption rather than unrelated urban issues?
Clarity Is the answer easy to understand and well organized?
Completeness Does it cover affordability, convenience, infrastructure, reliability, and incentives?
Safety / fairness Does it avoid harmful, biased, or exclusionary suggestions?

This makes the reward more interpretable, decomposable, and actionable.

Rubric Generation Strategies

Rubric generation strategies

Figure 3. Rubric construction paradigms for large model alignment.

Strategy Core idea When it is useful
Expert direct annotation Experts write criteria explicitly. High-stakes domains and seed rubrics
Induction from expert QA annotations Criteria are extracted from annotated examples. Scaling expert knowledge beyond manual templates
Distillation from teacher demonstrations Rubrics are derived from high-quality model outputs. Bootstrapping scalable reward signals

Together, these strategies show how rubric construction moves from manual specification toward data-driven induction and model-driven distillation.

Repository Map

This repository is organized as a conceptual map of rubric-related research. We group papers by the role rubrics play in the large-model pipeline.

This organization helps show rubrics not only as evaluation tools, but also as structured interfaces connecting expert standards, feedback data, reward signals, training objectives, and deployment-time assessment.

Section Role in the repository
Foundations Introduces what counts as a rubric, how rubric formats differ from preferences, rules, or scalar scores, and why structured criteria become useful in large-model settings.
Data Covers how rubrics are collected, generated, refined, and organized into reusable supervision signals through human annotation, synthetic generation, expert labeling, and rubric datasets.
Training Summarizes how rubric-level judgments can be transformed into SFT data, preference objectives, RL rewards, curriculum signals, and self-improvement loops.
Evaluation Connects rubrics to LLM-as-a-judge protocols, benchmark design, calibration, reliability analysis, and robustness checks, where explicit and auditable criteria are especially important.
Applications Shows how rubric-based methods extend beyond text QA to multimodal tasks, agent systems, and professional domains that require domain-specific standards.

Overall, this structure follows the lifecycle of rubric-based large-model alignment:

Define criteria β†’ collect or generate rubric data β†’ train with rubric signals β†’ evaluate with structured judges β†’ apply in domain-specific tasks

Rubrics provide a structured layer for connecting data, training, evaluation, and applications.

Table of Contents

Browse the reading list

Papers with publicly released code are marked with 🌟.

Foundations of Rubric-Based Evaluation

Rubric Definitions and Boundaries

Rubrics define structured evaluation dimensions, scoring rules, and judgment boundaries for open-ended model outputs. This section covers work that clarifies what counts as a rubric and how rubrics function as judges or reward criteria.

2025

  • 🌟 [arXiv 2025.10] From Implicit Weights to Explicit Rubrics: A Training-Free Framework for Reward Modeling [Code]
    Definition Format Reward Modeling and Signal Design LLM-as-a-Judge and Reward Reasoning
  • [ICLR 26] Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains [Data]
    Definition Synthetic Data

2024

Rubric Representation and Scoring Schemas

This section focuses on how rubrics are expressed, including dimensions, levels, weights, and scoring templates. It is useful for understanding the representational form that makes rubric-based supervision reusable and controllable.

2026

  • [arXiv 2026.03] AutoChecklist: Composable Pipelines for Checklist Generation and Scoring with LLM-as-a-Judge
    Format Reward Modeling and Signal Design LLM-as-a-Judge and Reward Reasoning

2025

  • 🌟 [arXiv 2025.10] From Implicit Weights to Explicit Rubrics: A Training-Free Framework for Reward Modeling [Code]
    Definition Format Reward Modeling and Signal Design LLM-as-a-Judge and Reward Reasoning

Traditional Domain Usage

  • No retained papers after full-text justification review.

Why Foundation Models Need Rubrics

2025

  • 🌟 [ICLR 26] Chasing the Tail: Effective Rubric-based Reward Modeling for Large Language Model Post-Training [Code]
    Why Rubrics

Rubrics for Foundation-Model Alignment and Evaluation

Rubric Construction and Data Sources

Synthetic Rubric Generation

Synthetic rubric generation uses generated tasks, labels, critiques, or rubric annotations to expand supervision beyond limited human labeling. It is especially useful when rubric-style feedback can be programmatically produced at scale for reward modeling or post-training.

2026
  • 🌟 [arXiv 2026.02] ClinAlign: Scaling Healthcare Alignment from Clinician Preference [Code]
    Synthetic Data Medical
  • 🌟 [arXiv 2026.01] RubricHub: A Comprehensive and Highly Discriminative Rubric Dataset via Automated Coarse-to-Fine Generation [Code]
    Synthetic Data Reward Modeling and Signal Design
2025
  • [arXiv 2025.10] OpenRubrics: Towards Scalable Synthetic Rubric Generation for Reward Modeling and LLM Alignment [Data]
    Synthetic Data Preference Tuning Reward Modeling and Signal Design
  • 🌟 [ICLR 26] OptimSyn: Influence-Guided Rubrics Optimization for Synthetic Data Generation [Code]
    Synthetic Data
  • [ICLR 26] Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains [Data]
    Definition Synthetic Data

Human- and Expert-Grounded Rubric Data

Human- and expert-grounded rubric data refers to signals from human preferences, authentic interactions, or domain specialists. These sources are important when alignment targets depend on nuanced standards that are hard to synthesize fully.

2026
  • 🌟 [arXiv 2026.04] Xpertbench: Expert Level Tasks with Rubrics-Based Evaluation [Code]
    Human Data Evaluation Benchmark Agentic Tasks
  • 🌟 [arXiv 2026.03] PresentBench: A Fine-Grained Rubric-Based Benchmark for Slide Generation [Code]
    Human Data Agentic Tasks
  • 🌟 [arXiv 2026.03] RubricBench: Aligning Model-Generated Rubrics with Human Standards [Code]
    Human Data Evaluation Benchmark
  • 🌟 [arXiv 2026.03] PRBench: End-to-end Paper Reproduction in Physics Research [Code]
    Human Data Agentic Tasks
  • [arXiv 2026.01] Health-SCORE: Towards Scalable Rubrics for Improving Health-LLMs
    Human Data Medical
2025
  • [arXiv 2025.10] Benchmarking and Learning Real-World Customer Service Dialogue
    Human Data Text Modality
  • 🌟 [ICLR 26] MENLO: From Preferences to Proficiency - Evaluating and Modeling Native-like Quality Across 47 Languages [Code]
    Human Data
  • 🌟 [arXiv 2025.05] HealthBench: Evaluating Large Language Models Towards Improved Human Health [Code]
    Human Data Evaluation Benchmark Medical
  • 🌟 [arXiv 2025.04] PaperBench: Evaluating AI's Ability to Replicate AI Research [Code]
    Human Data Evaluation Benchmark

Rubric-Guided Training and Post-Training

Pre-training

  • No retained papers after full-text justification review.

Post-training

Rubric-Guided Supervised Fine-Tuning

Rubrics can be used in supervised fine-tuning for filtering data, weighting samples, or imposing structured response preferences. This makes SFT more aligned with multi-dimensional quality targets instead of flat imitation alone.

2025
  • [ICLR 26] P-GenRM: Personalized Generative Reward Model with Test-time User-based Scaling
    Supervised Fine-Tuning Curriculum Learning
Rubric-Guided Preference Tuning

Direct preference-learning methods use rubric signals to construct, weight, or structure preference objectives, including DPO-style and preference-tuning approaches.

2026
  • 🌟 [arXiv 2026.03] AdaRubric: Task-Adaptive Rubrics for LLM Agent Evaluation [Code]
    Preference Tuning LLM-as-a-Judge and Reward Reasoning Agentic Tasks
2025
  • [arXiv 2025.10] OpenRubrics: Towards Scalable Synthetic Rubric Generation for Reward Modeling and LLM Alignment [Model]
    Synthetic Data Preference Tuning Reward Modeling and Signal Design
  • [arXiv 2025.08] Are Today's LLMs Ready to Explain Well-Being Concepts?
    Preference Tuning Text Modality
  • [ICML-W] Configurable Preference Tuning with Rubric-Guided Synthetic Data
    Preference Tuning
Rubric-Aware Policy Optimization

RL post-training methods modify policy optimization, advantage estimation, exploration, or training stability when rewards are rubric-based or multi-dimensional.

2026
  • 🌟 [arXiv 2026.04] Rubrics to Tokens: Bridging Response-level Rubrics and Token-level Rewards in Instruction Following Tasks [Code]
    Policy Optimization Reward Modeling and Signal Design
  • [arXiv 2026.03] Listening to the Echo: User-Reaction Aware Policy Optimization via Scalar-Verbal Hybrid Reinforcement Learning
    Policy Optimization
  • [arXiv 2026.03] Alternating Reinforcement Learning with Contextual Rubric Rewards
    Policy Optimization
  • 🌟 [arXiv 2026.03] Experience is the Best Teacher: Motivating Effective Exploration in Reinforcement Learning for LLMs [Code]
    Policy Optimization
  • 🌟 [arXiv 2026.03] PAPO: Stabilizing Rubric Integration Training via Decoupled Advantage Normalization [Code]
    Policy Optimization
  • [arXiv 2026.02] Alternating Reinforcement Learning for Rubric-Based Reward Modeling in Non-Verifiable LLM Post-Training [Model]
    Policy Optimization Reward Modeling and Signal Design
  • [arXiv 2026.02] Learning to Self-Verify Makes Language Models Better Reasoners
    Policy Optimization
  • 🌟 [arXiv 2026.02] CM2: Reinforcement Learning with Checklist Rewards for Multi-Turn and Multi-Step Agentic Tool Use [Code]
    Policy Optimization Agentic Tasks
  • 🌟 [arXiv 2026.02] Open Rubric System: Scaling Reinforcement Learning with Pairwise Adaptive Rubric [Code]
    Policy Optimization Reward Modeling and Signal Design
  • 🌟 [arXiv 2026.01] Chaining the Evidence: Robust Reinforcement Learning for Deep Search Agents with Citation-Aware Rubric Rewards [Code]
    Policy Optimization Agentic Tasks
  • [arXiv 2026.01] GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization [Proj]
    Policy Optimization
2025
  • [arXiv 2025.08] Pareto Multi-Objective Alignment for Language Models
    Policy Optimization
  • [arXiv 2025.06] Direct Reasoning Optimization: Constrained RL with Token-Level Dense Reward and Rubric-Gated Constraints for Open-ended Tasks
    Policy Optimization
  • 🌟 [arXiv 2025.08] Breaking the Exploration Bottleneck: Rubric-Scaffolded Reinforcement Learning for General LLM Reasoning [Code]
    Policy Optimization
  • [arXiv 2025.09] Quantile Advantage Estimation: Stabilizing RLVR for LLM Reasoning
    Policy Optimization
  • [arXiv 2025.10] Reinforcement Learning for Tool-Integrated Interleaved Thinking towards Cross-Domain Generalization
    Policy Optimization Agentic Tasks
  • [arXiv 2025.11] Reward and Guidance through Rubrics: Promoting Exploration to Improve Multi-Domain Reasoning
    Policy Optimization
Rubric-Based Reward Modeling and Signal Design

Methods in this section design, generate, calibrate, or densify rubric-based reward signals, including reward models, LLM judges, checklist rewards, and rubric-to-token supervision.

2026
  • 🌟 [arXiv 2026.05] Auto-Rubric as Reward: From Implicit Preferences to Explicit Multimodal Generative Criteria [Code]
    Policy Optimization Reward Modeling and Signal Design
  • [Tech Report 2026.04] DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence [Model]
    Reward Modeling and Signal Design
  • 🌟 [arXiv 2026.04] Rubrics to Tokens: Bridging Response-level Rubrics and Token-level Rewards in Instruction Following Tasks [Code]
    Policy Optimization Reward Modeling and Signal Design
  • 🌟 [arXiv 2026.03] CDRRM: Contrast-Driven Rubric Generation for Reliable and Interpretable Reward Modeling [Code]
    Reward Modeling and Signal Design LLM-as-a-Judge and Reward Reasoning
  • 🌟 [arXiv 2026.03] AutoChecklist: Composable Pipelines for Checklist Generation and Scoring with LLM-as-a-Judge [Code]
    Format Reward Modeling and Signal Design LLM-as-a-Judge and Reward Reasoning
  • [arXiv 2026.03] RubricRAG: Towards Interpretable and Reliable LLM Evaluation via Domain Knowledge Retrieval for Rubric Generation
    Reward Modeling and Signal Design LLM-as-a-Judge and Reward Reasoning
  • [arXiv 2026.02] Alternating Reinforcement Learning for Rubric-Based Reward Modeling in Non-Verifiable LLM Post-Training [Model]
    Policy Optimization Reward Modeling and Signal Design
  • 🌟 [arXiv 2026.02] Open Rubric System: Scaling Reinforcement Learning with Pairwise Adaptive Rubric [Code]
    Policy Optimization Reward Modeling and Signal Design
  • 🌟 [arXiv 2026.02] OMNI-RRM: Advancing Omni Reward Model [Code]
    Reward Modeling and Signal Design
  • [arXiv 2026.02] Grad2Reward: From Sparse Judgment to Dense Rewards for Improving Open-Ended LLM Reasoning
    Reward Modeling and Signal Design
  • [arXiv 2026.02] Learning Query-Specific Rubrics from Human Preferences for DeepResearch Report Generation
    Reward Modeling and Signal Design Agentic Tasks
  • [arXiv 2026.02] Rethinking Rubric Generation for Improving LLM Judge and Reward Modeling for Open-ended Tasks
    Reward Modeling and Signal Design LLM-as-a-Judge and Reward Reasoning
  • [arXiv 2026.02] Features as Rewards: Scalable Supervision for Open-Ended Tasks via Interpretability
    Reward Modeling and Signal Design
  • [arXiv 2026.02] SibylSense: Adaptive Rubric Learning via Memory Tuning and Adversarial Probing
    Reward Modeling and Signal Design
  • 🌟 [arXiv 2026.01] RubricHub: A Comprehensive and Highly Discriminative Rubric Dataset via Automated Coarse-to-Fine Generation [Code]
    Synthetic Data Reward Modeling and Signal Design
  • 🌟 [arXiv 2026.01] Reward Modeling for Scientific Writing Evaluation [Code]
    Reward Modeling and Signal Design
  • 🌟 [arXiv 2026.01] P-Check: Advancing Personalized Reward Models via Learning to Generate Dynamic Checklists [Code]
    Reward Modeling and Signal Design
  • [arXiv 2026.01] Rewarding Creativity: A Human-Aligned Generative Reward Model for Reinforcement Learning in Storytelling
    Reward Modeling and Signal Design Text Modality
2025
  • 🌟 [arXiv 2025.12] TableGPT-R1: Advancing Tabular Reasoning Through Reinforcement Learning [Code]
    Reward Modeling and Signal Design
  • 🌟 [arXiv 2025.11] AdvancedIF: Rubric-Based Benchmarking and Reinforcement Learning for Advancing LLM Instruction Following [Code]
    Reward Modeling and Signal Design Evaluation Benchmark
  • 🌟 [arXiv 2025.10] From Implicit Weights to Explicit Rubrics: A Training-Free Framework for Reward Modeling [Code]
    Definition Format Reward Modeling and Signal Design LLM-as-a-Judge and Reward Reasoning
  • [arXiv 2025.10] OpenRubrics: Towards Scalable Synthetic Rubric Generation for Reward Modeling and LLM Alignment [Data] [Model]
    Synthetic Data Preference Tuning Reward Modeling and Signal Design
  • [ICLR 26] RLAC: Reinforcement Learning with Adversarial Critic for Free-Form Generation Tasks [Proj]
    Reward Modeling and Signal Design
  • [arXiv 2025.08] Reinforcement Learning with Rubric Anchors
    Reward Modeling and Signal Design
  • 🌟 [arXiv 2025.06] AutoRule: Reasoning Chain-of-Thought Extracted Rule-Based Rewards Improve Preference Learning [Code]
    Reward Modeling and Signal Design
  • [ICLR 26] Robust Reward Modeling via Causal Rubrics
    Reward Modeling and Signal Design
  • 🌟 [NeurIPS 25] Checklists Are Better Than Reward Models For Aligning Language Models [Code]
    Reward Modeling and Signal Design
  • 🌟 [arXiv 2025.05] R3: Robust Rubric-Agnostic Reward Models [Code]
    Reward Modeling and Signal Design LLM-as-a-Judge and Reward Reasoning
Rubric-Structured Curriculum Learning

Curriculum learning studies how rubric dimensions or difficulty levels can stage training over time. It is relevant when structured feedback is used not only to score outputs but also to organize learning progression.

2026
  • [arXiv 2026.02] RuCL: Stratified Rubric-Based Curriculum Learning for Multimodal Large Language Model Reasoning
    Curriculum Learning
2025
  • 🌟 [ICLR 26] P-GenRM: Personalized Generative Reward Model with Test-time User-based Scaling [Code]
    Supervised Fine-Tuning Curriculum Learning
Rubric-Guided Self-Improvement
2026
  • 🌟 [arXiv 2026.02] Reinforcing Chain-of-Thought Reasoning with Self-Evolving Rubrics [Code]
    Self-Improvement

Rubric-Based Evaluation

Evaluation methods focus on how rubrics are used to judge outputs reliably and consistently across tasks. This includes LLM-as-a-judge settings, rubric-aware reward reasoning, and methods that improve interpretability of evaluation.

LLM-as-a-Judge and Reward Reasoning

2026
  • [arXiv 2026.03] RubricRAG: Towards Interpretable and Reliable LLM Evaluation via Domain Knowledge Retrieval for Rubric Generation [Data]
    Reward Modeling and Signal Design LLM-as-a-Judge and Reward Reasoning
  • [arXiv 2026.03] Beyond the Illusion of Consensus: From Surface Heuristics to Knowledge-Grounded Evaluation in LLM-as-a-Judge
    LLM-as-a-Judge and Reward Reasoning
  • 🌟 [arXiv 2026.03] CDRRM: Contrast-Driven Rubric Generation for Reliable and Interpretable Reward Modeling [Code]
    Reward Modeling and Signal Design LLM-as-a-Judge and Reward Reasoning
  • 🌟 [arXiv 2026.03] AutoChecklist: Composable Pipelines for Checklist Generation and Scoring with LLM-as-a-Judge [Code]
    Format Reward Modeling and Signal Design LLM-as-a-Judge and Reward Reasoning
  • [arXiv 2026.03] Examining Reasoning LLMs-as-Judges in Non-Verifiable LLM Post-Training
    LLM-as-a-Judge and Reward Reasoning
  • 🌟 [arXiv 2026.03] AdaRubric: Task-Adaptive Rubrics for LLM Agent Evaluation [Code]
    Preference Tuning LLM-as-a-Judge and Reward Reasoning Agentic Tasks
  • [arXiv 2026.02] Rethinking Rubric Generation for Improving LLM Judge and Reward Modeling for Open-ended Tasks
    Reward Modeling and Signal Design LLM-as-a-Judge and Reward Reasoning
  • 🌟 [arXiv 2026.02] Rubrics as an Attack Surface: Stealthy Preference Drift in LLM Judges [Code]
    LLM-as-a-Judge and Reward Reasoning
  • 🌟 [arXiv 26.01] RULERS: Locked Rubrics and Evidence-Anchored Scoring for Robust LLM Evaluation [Code]
    LLM-as-a-Judge and Reward Reasoning
  • 🌟 [ICLR 26] mR3: Multilingual Rubric-Agnostic Reward Reasoning Models [Code]
    LLM-as-a-Judge and Reward Reasoning
  • [ICLR 26] RM-R1: Reward Modeling as Reasoning [Proj]
    LLM-as-a-Judge and Reward Reasoning
  • [ICLR 26] MENLO: From Preferences to Proficiency - Evaluating and Modeling Native-like Quality Across 47 Languages [Data]
    LLM-as-a-Judge and Reward Reasoning
  • 🌟 [ICLR 26] Retro: Optimizing LLMs for Reasoning-Intensive Document Retrieval [Code]
    LLM-as-a-Judge and Reward Reasoning
2025
  • 🌟 [arXiv 2025.10] From Implicit Weights to Explicit Rubrics: A Training-Free Framework for Reward Modeling [Code]
    Definition Format Reward Modeling and Signal Design LLM-as-a-Judge and Reward Reasoning
  • 🌟 [arXiv 2025.05] R3: Robust Rubric-Agnostic Reward Models [Code]
    Reward Modeling and Signal Design LLM-as-a-Judge and Reward Reasoning
  • 🌟 [arXiv 2025.01] SedarEval: Automated Evaluation using Self-Adaptive Rubrics [Code]
    LLM-as-a-Judge and Reward Reasoning

Statistical and Uncertainty-Aware Evaluation

2025
  • 🌟 [ICLR 26] Don't Pass@k: A Bayesian Framework for Large Language Model Evaluation [Code]
    Statistical Evaluation

Rubric-Based Evaluation Benchmarks

Benchmark work provides datasets and tasks where rubric-based evaluation can be compared, stress-tested, and standardized. These resources are important for measuring whether rubric-trained or rubric-judged systems generalize across realistic scenarios.

2026
  • 🌟 [arXiv 2026.03] RubricBench: Aligning Model-Generated Rubrics with Human Standards [Code]
    Reward Modeling and Signal Design Evaluation Benchmark
  • 🌟 [arXiv 2026.03] Is this Idea Novel? An Automated Benchmark for Judgment of Research Ideas [Code]
    Evaluation Benchmark
  • [arXiv 2026.03] Beyond Binary Correctness: Scaling Evaluation of Long-Horizon Agents on Subjective Enterprise Tasks
    Evaluation Benchmark Agentic Tasks
  • [arXiv 2026.03] MiroEval: Benchmarking Multimodal Deep Research Agents in Process and Outcome [Proj]
    Evaluation Benchmark
  • [arXiv 2026.03] Xpertbench: Expert Level Tasks with Rubrics-Based Evaluation [Proj]
    Human Data Evaluation Benchmark Agentic Tasks
  • [arXiv 2026.03] $OneMillion-Bench: How Far are Language Agents from Human Experts?
    Evaluation Benchmark
  • 🌟 [arXiv 2026.02] LiveMedBench: A Contamination-Free Medical Benchmark for LLMs with Automated Rubric Evaluation [Code]
    Evaluation Benchmark Medical
  • [arXiv 2026.02] When and What to Ask: AskBench and Rubric-Guided RLVR for LLM Clarification
    Evaluation Benchmark Text Modality Agentic Tasks
  • 🌟 [arXiv 2026.01] PLAW BENCH : A Rubric-Based Benchmark for Evaluating LLMs in Real-World Legal Practice [Code]
    Evaluation Benchmark
  • [arXiv 2026.01] Frontier Science: Evaluating AI's Ability to Perform Expert-Level Scientific Tasks [Data]
    Evaluation Benchmark
  • [arXiv 2026.01] UEval: A Benchmark for Unified Multimodal Generation [Proj]
    Evaluation Benchmark
2025
  • [arXiv 2025.11] RESEARCH RUBRICS : A Benchmark of Prompts and Rubrics For Evaluating Deep Research Agents [Proj]
    Evaluation Benchmark
  • [arXiv 2025.11] Evaluating Legal Reasoning Traces with Legal Issue Tree Rubrics
    Evaluation Benchmark Text Modality
  • 🌟 [arXiv 2025.11] AdvancedIF: Rubric-Based Benchmarking and Reinforcement Learning for Advancing LLM Instruction Following [Code]
    Reward Modeling and Signal Design Evaluation Benchmark
  • [arXiv 2025.10] GDPVAL : EVALUATING AI MODEL PERFORMANCE ON REAL-WORLD ECONOMICALLY VALUABLE TASKS [Data]
    Evaluation Benchmark
  • [arXiv 2025.10] MOREBENCH : EVALUATING PROCEDURAL AND PLoReBench: Evaluating Procedural and Pluralistic Moral Reasoning in Language Models, More than Outcomes [Proj]
    Evaluation Benchmark
  • 🌟 [ICLR 26] ProfBench: Multi-Domain Rubrics requiring Professional Knowledge to Answer and Judge [Code]
    Evaluation Benchmark
  • [ICLR 26] ExpertLongBench: Benchmarking Language Models on Expert-Level Long-Form Generation Tasks with Structured Checklists [Proj]
    Evaluation Benchmark
  • 🌟 [arXiv 2025.07] Generalizing Verifiable Instruction Following [Code]
    Evaluation Benchmark
  • 🌟 [arXiv 2025.05] HealthBench: Evaluating Large Language Models Towards Improved Human Health [Code]
    Human Data Evaluation Benchmark Medical
  • 🌟 [arXiv 2025.04] PaperBench: Evaluating AI's Ability to Replicate AI Research [Code]
    Human Data Evaluation Benchmark

Application Settings of Rubrics

Applications grouped by modality and domain, highlighting where rubrics help capture quality, safety, and task completion.

Rubrics Across Modalities

Text Modality

This section covers rubric use in text generation, dialogue, and reasoning-heavy language tasks. The emphasis is on how structured criteria guide evaluation or training for open-ended textual outputs.

2026
  • 🌟 [arXiv 2026.02] When and What to Ask: AskBench and Rubric-Guided RLVR for LLM Clarification [Code]
    Evaluation Benchmark Text Modality Agentic Tasks
  • [arXiv 2026.01] Rewarding Creativity: A Human-Aligned Generative Reward Model for Reinforcement Learning in Storytelling
    Reward Modeling and Signal Design Text Modality
2025
  • [arXiv 2025.12] Evaluating Legal Reasoning Traces with Legal Issue Tree Rubrics
    Evaluation Benchmark Text Modality
  • [arXiv 2025.10] Benchmarking and Learning Real-World Customer Service Dialogue
    Human Data Text Modality
  • [arXiv 2025.08] Are Today's LLMs Ready to Explain Well-Being Concepts?
    Preference Tuning Text Modality
  • [ICLR 26] QuRL: Rubrics As Judge For Open-Ended Question Answering
    Text Modality
  • 🌟 [ICLR 26] The CoT Encyclopedia: Analyzing, Predicting, and Controlling how a Reasoning Model will Think [Code]
    Text Modality

Vision Modality

Visual rubric work extends structured judging and reward design to images, videos, and vision-language tasks. It is useful when model quality depends on multiple perceptual and semantic dimensions rather than a single scalar objective.

2026
  • [arXiv 2026.03] RubiCap: Rubric-Guided Reinforcement Learning for Dense Image Captioning
    Vision Modality
  • 🌟 [arXiv 2026.03] Rationale Matters: Learning Transferable Rubrics via Proxy-Guided Critique for VLM Reward Models [Code]
    Vision Modality
  • [arXiv 2026.03] When Rubrics Fail: Error Enumeration as Reward in Reference-Free RL Post-Training for Virtual Try-On
    Vision Modality
  • [arXiv 2026.01] SOCIAL CAPTION: Evaluating Social Understanding in Multimodal Models
    Vision Modality
2025
  • [arXiv 2025.11] RubricRL: Simple Generalizable Rewards for Text-to-Image Generation
    Vision Modality
  • [arXiv 2025.10] Beyond Seeing: Evaluating Multimodal LLMs on Tool-Enabled Image Perception [Proj]
    Vision Modality Agentic Tasks
  • 🌟 [ICLR 26] MicroVerse: A Preliminary Exploration Toward a Micro-World Simulation [Code]
    Vision Modality

Audio Modality

  • No retained papers after full-text justification review.

Rubrics Across Domains

Medical

Medical applications use rubrics to capture expert standards, safety expectations, and multi-step clinical reasoning quality. This is important because medical evaluation often cannot be reduced to single-answer correctness.

2026
  • 🌟 [arXiv 2026.03] QuarkMedBench: A Real-World Scenario Driven Benchmark for Evaluating Large Language Models [Code]
    Evaluation Benchmark Medical
  • [arXiv 2026.03] MedMT-Bench: Can LLMs Memorize and Understand Long Multi-Turn Conversations in Medical Scenarios?
    Evaluation Benchmark Medical
  • 🌟 [arXiv 2026.02] ClinAlign: Scaling Healthcare Alignment from Clinician Preference [Code]
    Synthetic Data Medical
  • [arXiv 2026.02] LiveMedBench: A Contamination-Free Medical Benchmark for LLMs with Automated Rubric Evaluation
    Evaluation Benchmark Medical
  • [arXiv 2026.02] Quark Medical Alignment: A Holistic Multi-Dimensional Alignment and Collaborative Optimization Paradigm
    Medical
  • [arXiv 2026.01] RubRIX: Rubric-Driven Risk Mitigation in Caregiver-AI Interactions
    Medical
  • [arXiv 2026.01] Health-SCORE: Towards Scalable Rubrics for Improving Health-LLMs
    Human Data Medical
2025
  • 🌟 [arXiv 2025.10] InfiMed-ORBIT: Aligning LLMs on Open-Ended Complex Tasks via Rubric-Based Incremental Training [Code]
    Medical
  • [arXiv 2025.09] Baichuan-M2: Scaling Medical Capability with Large Verifier System
    Medical
  • 🌟 [arXiv 2025.05] HealthBench: Evaluating Large Language Models Towards Improved Human Health [Code]
    Human Data Evaluation Benchmark Medical

Software Engineering and Code Agents

Code-domain rubric work studies structured evaluation for coding, debugging, and software-agent behavior.

2026
  • [arXiv 2026.01] Agentic Rubrics as Contextual Verifiers for SWE Agents
    Code Agents

Agentic Tasks

Agent settings require rubrics to evaluate long-horizon behavior, tool use, planning, and subjective task completion. This section highlights work where structured criteria are central to assessing or training interactive agents.

2026
  • 🌟 [arXiv 2026.04] Xpertbench: Expert Level Tasks with Rubrics-Based Evaluation [Code]
    Human Data Evaluation Benchmark Agentic Tasks
  • 🌟 [arXiv 2026.03] PresentBench: A Fine-Grained Rubric-Based Benchmark for Slide Generation [Code]
    Human Data Agentic Tasks
  • 🌟 [arXiv 2026.03] AdaRubric: Task-Adaptive Rubrics for LLM Agent Evaluation [Code]
    Preference Tuning LLM-as-a-Judge and Reward Reasoning Agentic Tasks
  • [arXiv 2026.03] Beyond Binary Correctness: Scaling Evaluation of Long-Horizon Agents on Subjective Enterprise Tasks
    Evaluation Benchmark Agentic Tasks
  • 🌟 [arXiv 2026.03] PRBench: End-to-end Paper Reproduction in Physics Research [Code]
    Human Data Agentic Tasks
  • [arXiv 2026.02] Learning Query-Specific Rubrics from Human Preferences for DeepResearch Report Generation
    Reward Modeling and Signal Design Agentic Tasks
  • 🌟 [arXiv 2026.02] When and What to Ask: AskBench and Rubric-Guided RLVR for LLM Clarification [Code]
    Evaluation Benchmark Text Modality Agentic Tasks
  • 🌟 [arXiv 2026.02] CM2: Reinforcement Learning with Checklist Rewards for Multi-Turn and Multi-Step Agentic Tool Use [Code]
    Policy Optimization Agentic Tasks
  • 🌟 [arXiv 2026.01] Technical Report Tongyi DeepResearch [Code]
    Agentic Tasks
  • 🌟 [arXiv 2026.01] Chaining the Evidence: Robust Reinforcement Learning for Deep Search Agents with Citation-Aware Rubric Rewards [Code]
    Policy Optimization Agentic Tasks
  • 🌟 [arXiv 2026.01] Mock Worlds, Real Skills: Building Small Agentic Language Models with Synthetic Tasks, Simulated Environments, and Rubric-Based Rewards [Code]
    Agentic Tasks
2025
  • [arXiv 2025.12] ARCANE: A Multi-Agent Framework for Interpretable and Configurable Alignment
    Agentic Tasks
  • 🌟 [arXiv 2025.12] Step-DeepResearch Technical Report [Code]
    Agentic Tasks
  • [arXiv 2025.10] Reinforcement Learning for Tool-Integrated Interleaved Thinking towards Cross-Domain Generalization
    Policy Optimization Agentic Tasks
  • [arXiv 2025.10] Beyond Seeing: Evaluating Multimodal LLMs on Tool-Enabled Image Perception
    Vision Modality Agentic Tasks
  • [NeurIPS 25-W] Towards Real-World Evaluation of Agentic Work in Freelance Marketplaces
    Agentic Tasks

LICENSE

This project is licensed under the MIT License - see the LICENSE file for details.

Contact

If you have any questions or suggestions, please feel free to contact Hongru Xiao.

About

A curated list of resources (surveys, papers, benchmarks, and opensource projects) on Rubrics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors