A curated reading list on rubric-based evaluation, reward modeling, and post-training for large models.
Rubrics turn expert judgment into structured criteria, auditable LLM judges, and trainable reward signals.
What are Rubrics? Β· Why Rubrics Matter? Β· Repository Map Β· Table of Contents
Papers with publicly released code or project resources include an inline [[Code](...)] link. Entries without verified repositories omit that link.
Contributions are welcome. If you find missing papers, inaccurate classifications, or newly released code, feel free to update this list.
In the context of LLM evaluation and alignment, a rubric is a structured set of criteria for judging open-ended model outputs. Instead of asking a human or LLM judge for one vague preference, rubrics decompose quality into explicit dimensions, scoring rules, and evidence requirements.
Rubrics make subjective judgment more inspectable:
- What to judge: relevance, factuality, completeness, safety, reasoning quality, style, or domain-specific standards.
- How to judge: score levels, checklists, pairwise criteria, evidence anchors, or weighted dimensions.
- How to use the judgment: evaluation reports, LLM-as-a-Judge protocols, reward models, preference tuning, policy optimization, and curriculum learning.
Figure 1. Rubrics convert coarse feedback into fine-grained, inspectable reward signals.
| Feedback style | Typical signal | Best fit | Main limitation |
|---|---|---|---|
| RLHF / model-based preference | "Output A is better than output B." | Open-ended comparison | Coarse and hard to inspect |
| RLVR / rule-based reward | Format is correct, answer matches, reasoning token appears, list structure exists | Verifiable tasks | Too rigid for subjective or open-ended tasks |
| Rubric-based feedback | Relevance, completeness, clarity, safety, each scored separately | Open-ended evaluation and training | Requires careful design and calibration |
Rubrics are the middle layer: more structured than model-only preference, more flexible than hard rules.
"In this new era, evaluation becomes more important than training."
- Shunyu Yao, The Second Half (2025)
As large models move from closed-form QA to open-ended reasoning, agents, multimodal generation, and professional domains, progress is increasingly bottlenecked by evaluation and feedback design. Training can optimize only what the system can measure, and many important tasks cannot be reduced to a single scalar reward.
| Rubrics help answer | Why it matters |
|---|---|
| What counts as good behavior? | They define explicit criteria, scoring boundaries, and failure modes. |
| How can expert judgment scale? | They convert tacit standards into reusable evaluation instructions and datasets. |
| How can LLM-as-a-Judge become less opaque? | Judges can be required to expose criteria, evidence, scores, and rationales. |
| How does evaluation become training signal? | Rubric-level feedback can supervise SFT, preference tuning, policy optimization, reward modeling, and curriculum learning. |
Rubrics therefore act as a bridge between human standards and machine-optimizable signals. They are not merely annotation templates; they are a control surface for evaluation, reward modeling, and post-training.
Figure 2. The number of rubric-related papers has grown rapidly, suggesting increasing research attention to structured evaluation and reward design.
The rising trend shows that rubric-based methods are becoming an increasingly important direction for large-model alignment, especially as evaluation, reward modeling, and post-training move toward more structured and auditable feedback.
Evaluation is no longer only a post-hoc metric. It is becoming part of the infrastructure of AI systems:
π§ββοΈ Expert Standards β π Rubrics β π Evaluation Signals β π― Rewards β π Training Dynamics
Rubrics are therefore not just for judging model outputs. They are a way to automate parts of expert feedback: experts define criteria, models apply them at scale, and failures reveal where the rubric or judge must be revised. In this sense, evaluation becomes an executable form of domain knowledge.
For the query:
How can cities encourage more people to use public transport?
a rubric does not directly ask "which answer is better?" It decomposes the judgment:
| Component | What the judge checks |
|---|---|
| Relevance | Does the answer address public transport adoption rather than unrelated urban issues? |
| Clarity | Is the answer easy to understand and well organized? |
| Completeness | Does it cover affordability, convenience, infrastructure, reliability, and incentives? |
| Safety / fairness | Does it avoid harmful, biased, or exclusionary suggestions? |
This makes the reward more interpretable, decomposable, and actionable.
Figure 3. Rubric construction paradigms for large model alignment.
| Strategy | Core idea | When it is useful |
|---|---|---|
| Expert direct annotation | Experts write criteria explicitly. | High-stakes domains and seed rubrics |
| Induction from expert QA annotations | Criteria are extracted from annotated examples. | Scaling expert knowledge beyond manual templates |
| Distillation from teacher demonstrations | Rubrics are derived from high-quality model outputs. | Bootstrapping scalable reward signals |
Together, these strategies show how rubric construction moves from manual specification toward data-driven induction and model-driven distillation.
This repository is organized as a conceptual map of rubric-related research. We group papers by the role rubrics play in the large-model pipeline.
This organization helps show rubrics not only as evaluation tools, but also as structured interfaces connecting expert standards, feedback data, reward signals, training objectives, and deployment-time assessment.
| Section | Role in the repository |
|---|---|
| Foundations | Introduces what counts as a rubric, how rubric formats differ from preferences, rules, or scalar scores, and why structured criteria become useful in large-model settings. |
| Data | Covers how rubrics are collected, generated, refined, and organized into reusable supervision signals through human annotation, synthetic generation, expert labeling, and rubric datasets. |
| Training | Summarizes how rubric-level judgments can be transformed into SFT data, preference objectives, RL rewards, curriculum signals, and self-improvement loops. |
| Evaluation | Connects rubrics to LLM-as-a-judge protocols, benchmark design, calibration, reliability analysis, and robustness checks, where explicit and auditable criteria are especially important. |
| Applications | Shows how rubric-based methods extend beyond text QA to multimodal tasks, agent systems, and professional domains that require domain-specific standards. |
Overall, this structure follows the lifecycle of rubric-based large-model alignment:
Define criteria β collect or generate rubric data β train with rubric signals β evaluate with structured judges β apply in domain-specific tasks
Rubrics provide a structured layer for connecting data, training, evaluation, and applications.
Browse the reading list
Papers with publicly released code are marked with π.
Rubrics define structured evaluation dimensions, scoring rules, and judgment boundaries for open-ended model outputs. This section covers work that clarifies what counts as a rubric and how rubrics function as judges or reward criteria.
- π [arXiv 2025.10] From Implicit Weights to Explicit Rubrics: A Training-Free Framework for Reward Modeling [Code]
- [ICLR 26] Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains [Data]
- [Blog 2024.11] Reward Hacking in Reinforcement Learning
This section focuses on how rubrics are expressed, including dimensions, levels, weights, and scoring templates. It is useful for understanding the representational form that makes rubric-based supervision reusable and controllable.
- [arXiv 2026.03] AutoChecklist: Composable Pipelines for Checklist Generation and Scoring with LLM-as-a-Judge
- π [arXiv 2025.10] From Implicit Weights to Explicit Rubrics: A Training-Free Framework for Reward Modeling [Code]
- No retained papers after full-text justification review.
- π [ICLR 26] Chasing the Tail: Effective Rubric-based Reward Modeling for Large Language Model Post-Training [Code]
Synthetic rubric generation uses generated tasks, labels, critiques, or rubric annotations to expand supervision beyond limited human labeling. It is especially useful when rubric-style feedback can be programmatically produced at scale for reward modeling or post-training.
- π [arXiv 2026.02] ClinAlign: Scaling Healthcare Alignment from Clinician Preference [Code]
- π [arXiv 2026.01] RubricHub: A Comprehensive and Highly Discriminative Rubric Dataset via Automated Coarse-to-Fine Generation [Code]
- [arXiv 2025.10] OpenRubrics: Towards Scalable Synthetic Rubric Generation for Reward Modeling and LLM Alignment [Data]
- π [ICLR 26] OptimSyn: Influence-Guided Rubrics Optimization for Synthetic Data Generation [Code]
- [ICLR 26] Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains [Data]
Human- and expert-grounded rubric data refers to signals from human preferences, authentic interactions, or domain specialists. These sources are important when alignment targets depend on nuanced standards that are hard to synthesize fully.
- π [arXiv 2026.04] Xpertbench: Expert Level Tasks with Rubrics-Based Evaluation [Code]
- π [arXiv 2026.03] PresentBench: A Fine-Grained Rubric-Based Benchmark for Slide Generation [Code]
- π [arXiv 2026.03] RubricBench: Aligning Model-Generated Rubrics with Human Standards [Code]
- π [arXiv 2026.03] PRBench: End-to-end Paper Reproduction in Physics Research [Code]
- [arXiv 2026.01] Health-SCORE: Towards Scalable Rubrics for Improving Health-LLMs
- [arXiv 2025.10] Benchmarking and Learning Real-World Customer Service Dialogue
- π [ICLR 26] MENLO: From Preferences to Proficiency - Evaluating and Modeling Native-like Quality Across 47 Languages [Code]
- π [arXiv 2025.05] HealthBench: Evaluating Large Language Models Towards Improved Human Health [Code]
- π [arXiv 2025.04] PaperBench: Evaluating AI's Ability to Replicate AI Research [Code]
- No retained papers after full-text justification review.
Rubrics can be used in supervised fine-tuning for filtering data, weighting samples, or imposing structured response preferences. This makes SFT more aligned with multi-dimensional quality targets instead of flat imitation alone.
- [ICLR 26] P-GenRM: Personalized Generative Reward Model with Test-time User-based Scaling
Direct preference-learning methods use rubric signals to construct, weight, or structure preference objectives, including DPO-style and preference-tuning approaches.
- π [arXiv 2026.03] AdaRubric: Task-Adaptive Rubrics for LLM Agent Evaluation [Code]
- [arXiv 2025.10] OpenRubrics: Towards Scalable Synthetic Rubric Generation for Reward Modeling and LLM Alignment [Model]
- [arXiv 2025.08] Are Today's LLMs Ready to Explain Well-Being Concepts?
- [ICML-W] Configurable Preference Tuning with Rubric-Guided Synthetic Data
RL post-training methods modify policy optimization, advantage estimation, exploration, or training stability when rewards are rubric-based or multi-dimensional.
- π [arXiv 2026.04] Rubrics to Tokens: Bridging Response-level Rubrics and Token-level Rewards in Instruction Following Tasks [Code]
- [arXiv 2026.03] Listening to the Echo: User-Reaction Aware Policy Optimization via Scalar-Verbal Hybrid Reinforcement Learning
- [arXiv 2026.03] Alternating Reinforcement Learning with Contextual Rubric Rewards
- π [arXiv 2026.03] Experience is the Best Teacher: Motivating Effective Exploration in Reinforcement Learning for LLMs [Code]
- π [arXiv 2026.03] PAPO: Stabilizing Rubric Integration Training via Decoupled Advantage Normalization [Code]
- [arXiv 2026.02] Alternating Reinforcement Learning for Rubric-Based Reward Modeling in Non-Verifiable LLM Post-Training [Model]
- [arXiv 2026.02] Learning to Self-Verify Makes Language Models Better Reasoners
- π [arXiv 2026.02] CM2: Reinforcement Learning with Checklist Rewards for Multi-Turn and Multi-Step Agentic Tool Use [Code]
- π [arXiv 2026.02] Open Rubric System: Scaling Reinforcement Learning with Pairwise Adaptive Rubric [Code]
- π [arXiv 2026.01] Chaining the Evidence: Robust Reinforcement Learning for Deep Search Agents with Citation-Aware Rubric Rewards [Code]
- [arXiv 2026.01] GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization [Proj]
- [arXiv 2025.08] Pareto Multi-Objective Alignment for Language Models
- [arXiv 2025.06] Direct Reasoning Optimization: Constrained RL with Token-Level Dense Reward and Rubric-Gated Constraints for Open-ended Tasks
- π [arXiv 2025.08] Breaking the Exploration Bottleneck: Rubric-Scaffolded Reinforcement Learning for General LLM Reasoning [Code]
- [arXiv 2025.09] Quantile Advantage Estimation: Stabilizing RLVR for LLM Reasoning
- [arXiv 2025.10] Reinforcement Learning for Tool-Integrated Interleaved Thinking towards Cross-Domain Generalization
- [arXiv 2025.11] Reward and Guidance through Rubrics: Promoting Exploration to Improve Multi-Domain Reasoning
Methods in this section design, generate, calibrate, or densify rubric-based reward signals, including reward models, LLM judges, checklist rewards, and rubric-to-token supervision.
- π [arXiv 2026.05] Auto-Rubric as Reward: From Implicit Preferences to Explicit Multimodal Generative Criteria [Code]
- [Tech Report 2026.04] DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence [Model]
- π [arXiv 2026.04] Rubrics to Tokens: Bridging Response-level Rubrics and Token-level Rewards in Instruction Following Tasks [Code]
- π [arXiv 2026.03] CDRRM: Contrast-Driven Rubric Generation for Reliable and Interpretable Reward Modeling [Code]
- π [arXiv 2026.03] AutoChecklist: Composable Pipelines for Checklist Generation and Scoring with LLM-as-a-Judge [Code]
- [arXiv 2026.03] RubricRAG: Towards Interpretable and Reliable LLM Evaluation via Domain Knowledge Retrieval for Rubric Generation
- [arXiv 2026.02] Alternating Reinforcement Learning for Rubric-Based Reward Modeling in Non-Verifiable LLM Post-Training [Model]
- π [arXiv 2026.02] Open Rubric System: Scaling Reinforcement Learning with Pairwise Adaptive Rubric [Code]
- π [arXiv 2026.02] OMNI-RRM: Advancing Omni Reward Model [Code]
- [arXiv 2026.02] Grad2Reward: From Sparse Judgment to Dense Rewards for Improving Open-Ended LLM Reasoning
- [arXiv 2026.02] Learning Query-Specific Rubrics from Human Preferences for DeepResearch Report Generation
- [arXiv 2026.02] Rethinking Rubric Generation for Improving LLM Judge and Reward Modeling for Open-ended Tasks
- [arXiv 2026.02] Features as Rewards: Scalable Supervision for Open-Ended Tasks via Interpretability
- [arXiv 2026.02] SibylSense: Adaptive Rubric Learning via Memory Tuning and Adversarial Probing
- π [arXiv 2026.01] RubricHub: A Comprehensive and Highly Discriminative Rubric Dataset via Automated Coarse-to-Fine Generation [Code]
- π [arXiv 2026.01] Reward Modeling for Scientific Writing Evaluation [Code]
- π [arXiv 2026.01] P-Check: Advancing Personalized Reward Models via Learning to Generate Dynamic Checklists [Code]
- [arXiv 2026.01] Rewarding Creativity: A Human-Aligned Generative Reward Model for Reinforcement Learning in Storytelling
- π [arXiv 2025.12] TableGPT-R1: Advancing Tabular Reasoning Through Reinforcement Learning [Code]
- π [arXiv 2025.11] AdvancedIF: Rubric-Based Benchmarking and Reinforcement Learning for Advancing LLM Instruction Following [Code]
- π [arXiv 2025.10] From Implicit Weights to Explicit Rubrics: A Training-Free Framework for Reward Modeling [Code]
- [arXiv 2025.10] OpenRubrics: Towards Scalable Synthetic Rubric Generation for Reward Modeling and LLM Alignment [Data] [Model]
- [ICLR 26] RLAC: Reinforcement Learning with Adversarial Critic for Free-Form Generation Tasks [Proj]
- [arXiv 2025.08] Reinforcement Learning with Rubric Anchors
- π [arXiv 2025.06] AutoRule: Reasoning Chain-of-Thought Extracted Rule-Based Rewards Improve Preference Learning [Code]
- [ICLR 26] Robust Reward Modeling via Causal Rubrics
- π [NeurIPS 25] Checklists Are Better Than Reward Models For Aligning Language Models [Code]
- π [arXiv 2025.05] R3: Robust Rubric-Agnostic Reward Models [Code]
Curriculum learning studies how rubric dimensions or difficulty levels can stage training over time. It is relevant when structured feedback is used not only to score outputs but also to organize learning progression.
- [arXiv 2026.02] RuCL: Stratified Rubric-Based Curriculum Learning for Multimodal Large Language Model Reasoning
- π [ICLR 26] P-GenRM: Personalized Generative Reward Model with Test-time User-based Scaling [Code]
- π [arXiv 2026.02] Reinforcing Chain-of-Thought Reasoning with Self-Evolving Rubrics [Code]
Evaluation methods focus on how rubrics are used to judge outputs reliably and consistently across tasks. This includes LLM-as-a-judge settings, rubric-aware reward reasoning, and methods that improve interpretability of evaluation.
- [arXiv 2026.03] RubricRAG: Towards Interpretable and Reliable LLM Evaluation via Domain Knowledge Retrieval for Rubric Generation [Data]
- [arXiv 2026.03] Beyond the Illusion of Consensus: From Surface Heuristics to Knowledge-Grounded Evaluation in LLM-as-a-Judge
- π [arXiv 2026.03] CDRRM: Contrast-Driven Rubric Generation for Reliable and Interpretable Reward Modeling [Code]
- π [arXiv 2026.03] AutoChecklist: Composable Pipelines for Checklist Generation and Scoring with LLM-as-a-Judge [Code]
- [arXiv 2026.03] Examining Reasoning LLMs-as-Judges in Non-Verifiable LLM Post-Training
- π [arXiv 2026.03] AdaRubric: Task-Adaptive Rubrics for LLM Agent Evaluation [Code]
- [arXiv 2026.02] Rethinking Rubric Generation for Improving LLM Judge and Reward Modeling for Open-ended Tasks
- π [arXiv 2026.02] Rubrics as an Attack Surface: Stealthy Preference Drift in LLM Judges [Code]
- π [arXiv 26.01] RULERS: Locked Rubrics and Evidence-Anchored Scoring for Robust LLM Evaluation [Code]
- π [ICLR 26] mR3: Multilingual Rubric-Agnostic Reward Reasoning Models [Code]
- [ICLR 26] RM-R1: Reward Modeling as Reasoning [Proj]
- [ICLR 26] MENLO: From Preferences to Proficiency - Evaluating and Modeling Native-like Quality Across 47 Languages [Data]
- π [ICLR 26] Retro: Optimizing LLMs for Reasoning-Intensive Document Retrieval [Code]
- π [arXiv 2025.10] From Implicit Weights to Explicit Rubrics: A Training-Free Framework for Reward Modeling [Code]
- π [arXiv 2025.05] R3: Robust Rubric-Agnostic Reward Models [Code]
- π [arXiv 2025.01] SedarEval: Automated Evaluation using Self-Adaptive Rubrics [Code]
Benchmark work provides datasets and tasks where rubric-based evaluation can be compared, stress-tested, and standardized. These resources are important for measuring whether rubric-trained or rubric-judged systems generalize across realistic scenarios.
- π [arXiv 2026.03] RubricBench: Aligning Model-Generated Rubrics with Human Standards [Code]
- π [arXiv 2026.03] Is this Idea Novel? An Automated Benchmark for Judgment of Research Ideas [Code]
- [arXiv 2026.03] Beyond Binary Correctness: Scaling Evaluation of Long-Horizon Agents on Subjective Enterprise Tasks
- [arXiv 2026.03] MiroEval: Benchmarking Multimodal Deep Research Agents in Process and Outcome [Proj]
- [arXiv 2026.03] Xpertbench: Expert Level Tasks with Rubrics-Based Evaluation [Proj]
- [arXiv 2026.03] $OneMillion-Bench: How Far are Language Agents from Human Experts?
- π [arXiv 2026.02] LiveMedBench: A Contamination-Free Medical Benchmark for LLMs with Automated Rubric Evaluation [Code]
- [arXiv 2026.02] When and What to Ask: AskBench and Rubric-Guided RLVR for LLM Clarification
- π [arXiv 2026.01] PLAW BENCH : A Rubric-Based Benchmark for Evaluating LLMs in Real-World Legal Practice [Code]
- [arXiv 2026.01] Frontier Science: Evaluating AI's Ability to Perform Expert-Level Scientific Tasks [Data]
- [arXiv 2026.01] UEval: A Benchmark for Unified Multimodal Generation [Proj]
- [arXiv 2025.11] RESEARCH RUBRICS : A Benchmark of Prompts and Rubrics For Evaluating Deep Research Agents [Proj]
- [arXiv 2025.11] Evaluating Legal Reasoning Traces with Legal Issue Tree Rubrics
- π [arXiv 2025.11] AdvancedIF: Rubric-Based Benchmarking and Reinforcement Learning for Advancing LLM Instruction Following [Code]
- [arXiv 2025.10] GDPVAL : EVALUATING AI MODEL PERFORMANCE ON REAL-WORLD ECONOMICALLY VALUABLE TASKS [Data]
- [arXiv 2025.10] MOREBENCH : EVALUATING PROCEDURAL AND PLoReBench: Evaluating Procedural and Pluralistic Moral Reasoning in Language Models, More than Outcomes [Proj]
- π [ICLR 26] ProfBench: Multi-Domain Rubrics requiring Professional Knowledge to Answer and Judge [Code]
- [ICLR 26] ExpertLongBench: Benchmarking Language Models on Expert-Level Long-Form Generation Tasks with Structured Checklists [Proj]
- π [arXiv 2025.07] Generalizing Verifiable Instruction Following [Code]
- π [arXiv 2025.05] HealthBench: Evaluating Large Language Models Towards Improved Human Health [Code]
- π [arXiv 2025.04] PaperBench: Evaluating AI's Ability to Replicate AI Research [Code]
Applications grouped by modality and domain, highlighting where rubrics help capture quality, safety, and task completion.
This section covers rubric use in text generation, dialogue, and reasoning-heavy language tasks. The emphasis is on how structured criteria guide evaluation or training for open-ended textual outputs.
- π [arXiv 2026.02] When and What to Ask: AskBench and Rubric-Guided RLVR for LLM Clarification [Code]
- [arXiv 2026.01] Rewarding Creativity: A Human-Aligned Generative Reward Model for Reinforcement Learning in Storytelling
- [arXiv 2025.12] Evaluating Legal Reasoning Traces with Legal Issue Tree Rubrics
- [arXiv 2025.10] Benchmarking and Learning Real-World Customer Service Dialogue
- [arXiv 2025.08] Are Today's LLMs Ready to Explain Well-Being Concepts?
- [ICLR 26] QuRL: Rubrics As Judge For Open-Ended Question Answering
- π [ICLR 26] The CoT Encyclopedia: Analyzing, Predicting, and Controlling how a Reasoning Model will Think [Code]
Visual rubric work extends structured judging and reward design to images, videos, and vision-language tasks. It is useful when model quality depends on multiple perceptual and semantic dimensions rather than a single scalar objective.
- [arXiv 2026.03] RubiCap: Rubric-Guided Reinforcement Learning for Dense Image Captioning
- π [arXiv 2026.03] Rationale Matters: Learning Transferable Rubrics via Proxy-Guided Critique for VLM Reward Models [Code]
- [arXiv 2026.03] When Rubrics Fail: Error Enumeration as Reward in Reference-Free RL Post-Training for Virtual Try-On
- [arXiv 2026.01] SOCIAL CAPTION: Evaluating Social Understanding in Multimodal Models
- [arXiv 2025.11] RubricRL: Simple Generalizable Rewards for Text-to-Image Generation
- [arXiv 2025.10] Beyond Seeing: Evaluating Multimodal LLMs on Tool-Enabled Image Perception [Proj]
- π [ICLR 26] MicroVerse: A Preliminary Exploration Toward a Micro-World Simulation [Code]
- No retained papers after full-text justification review.
Medical applications use rubrics to capture expert standards, safety expectations, and multi-step clinical reasoning quality. This is important because medical evaluation often cannot be reduced to single-answer correctness.
- π [arXiv 2026.03] QuarkMedBench: A Real-World Scenario Driven Benchmark for Evaluating Large Language Models [Code]
- [arXiv 2026.03] MedMT-Bench: Can LLMs Memorize and Understand Long Multi-Turn Conversations in Medical Scenarios?
- π [arXiv 2026.02] ClinAlign: Scaling Healthcare Alignment from Clinician Preference [Code]
- [arXiv 2026.02] LiveMedBench: A Contamination-Free Medical Benchmark for LLMs with Automated Rubric Evaluation
- [arXiv 2026.02] Quark Medical Alignment: A Holistic Multi-Dimensional Alignment and Collaborative Optimization Paradigm
- [arXiv 2026.01] RubRIX: Rubric-Driven Risk Mitigation in Caregiver-AI Interactions
- [arXiv 2026.01] Health-SCORE: Towards Scalable Rubrics for Improving Health-LLMs
- π [arXiv 2025.10] InfiMed-ORBIT: Aligning LLMs on Open-Ended Complex Tasks via Rubric-Based Incremental Training [Code]
- [arXiv 2025.09] Baichuan-M2: Scaling Medical Capability with Large Verifier System
- π [arXiv 2025.05] HealthBench: Evaluating Large Language Models Towards Improved Human Health [Code]
Code-domain rubric work studies structured evaluation for coding, debugging, and software-agent behavior.
- [arXiv 2026.01] Agentic Rubrics as Contextual Verifiers for SWE Agents
Agent settings require rubrics to evaluate long-horizon behavior, tool use, planning, and subjective task completion. This section highlights work where structured criteria are central to assessing or training interactive agents.
- π [arXiv 2026.04] Xpertbench: Expert Level Tasks with Rubrics-Based Evaluation [Code]
- π [arXiv 2026.03] PresentBench: A Fine-Grained Rubric-Based Benchmark for Slide Generation [Code]
- π [arXiv 2026.03] AdaRubric: Task-Adaptive Rubrics for LLM Agent Evaluation [Code]
- [arXiv 2026.03] Beyond Binary Correctness: Scaling Evaluation of Long-Horizon Agents on Subjective Enterprise Tasks
- π [arXiv 2026.03] PRBench: End-to-end Paper Reproduction in Physics Research [Code]
- [arXiv 2026.02] Learning Query-Specific Rubrics from Human Preferences for DeepResearch Report Generation
- π [arXiv 2026.02] When and What to Ask: AskBench and Rubric-Guided RLVR for LLM Clarification [Code]
- π [arXiv 2026.02] CM2: Reinforcement Learning with Checklist Rewards for Multi-Turn and Multi-Step Agentic Tool Use [Code]
- π [arXiv 2026.01] Technical Report Tongyi DeepResearch [Code]
- π [arXiv 2026.01] Chaining the Evidence: Robust Reinforcement Learning for Deep Search Agents with Citation-Aware Rubric Rewards [Code]
- π [arXiv 2026.01] Mock Worlds, Real Skills: Building Small Agentic Language Models with Synthetic Tasks, Simulated Environments, and Rubric-Based Rewards [Code]
- [arXiv 2025.12] ARCANE: A Multi-Agent Framework for Interpretable and Configurable Alignment
- π [arXiv 2025.12] Step-DeepResearch Technical Report [Code]
- [arXiv 2025.10] Reinforcement Learning for Tool-Integrated Interleaved Thinking towards Cross-Domain Generalization
- [arXiv 2025.10] Beyond Seeing: Evaluating Multimodal LLMs on Tool-Enabled Image Perception
- [NeurIPS 25-W] Towards Real-World Evaluation of Agentic Work in Freelance Marketplaces
This project is licensed under the MIT License - see the LICENSE file for details.
If you have any questions or suggestions, please feel free to contact Hongru Xiao.


