feat(guardrails): add Cleanlab TLM hallucination/trustworthiness guardrail by dni138 · Pull Request #170 · mozilla-ai/any-guardrail

dni138 · 2026-05-22T19:04:26Z

Summary

Adds CleanlabTlm guardrail wrapping Cleanlab's Trustworthy Language Model — a hosted scoring service that returns a trustworthiness score for any (prompt, response) pair. Designed for hallucination detection in RAG and agent pipelines.

Research backing

Chen & Mueller, Quantifying Uncertainty in Answers from any Language Model and Enhancing their Trustworthiness (ACL 2024) — BSDetector method underlying TLM.
Cleanlab agent-architecture hallucination benchmark — trust scores detect incorrect RAG responses with ~3× the precision of RAGAS/groundedness baselines.

Why commercial-only

TLM combines black-box intrinsic + extrinsic uncertainty signals over arbitrary LLM APIs as a hosted scoring service. No OSS package replicates the full pipeline (the underlying BSDetector method is published, but the production scoring service with its calibration and multi-signal fusion is closed-source).

Access

Free $5 of credits at signup.
Signup: https://tlm.cleanlab.ai/ → API key in dashboard.
Env var: CLEANLAB_TLM_API_KEY.

Triage notes

Not yet integration-tested — awaiting API key procurement.
Adds new optional extra: pip install 'any-guardrail[cleanlab-tlm]'.
Output semantics inverted: higher score = more trustworthy = more valid. Default threshold=0.7.
Inherits from Guardrail (not ThreeStageGuardrail) because validate(prompt, response) doesn't match the standard (input_text) signature.

Test plan

Provision a Cleanlab TLM API key
Add integration test under tests/integration/ with @pytest.mark.e2e using a known hallucinated response
Cookbook entry showing RAG-grounding use case

🤖 Generated with Claude Code

…drail Wraps Cleanlab's Trustworthy Language Model (TLM) as a hosted scoring guardrail for detecting hallucinations and low-confidence answers in RAG/agent pipelines. Inherits from Guardrail (not ThreeStageGuardrail) because validate() takes (prompt, response) instead of the standard (input_text) signature. Output semantics are inverted: higher trustworthiness score means more valid. Adds a new optional extra `cleanlab-tlm` (also pulled in by `all`). Unit tests mock the `cleanlab_tlm.TLM` client so no real API calls are made; integration testing is deferred until an API key is provisioned. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(guardrails): add Cleanlab TLM hallucination/trustworthiness guardrail#170

feat(guardrails): add Cleanlab TLM hallucination/trustworthiness guardrail#170
dni138 wants to merge 1 commit into
mainfrom
cleanlab-tlm

dni138 commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dni138 commented May 22, 2026

Summary

Research backing

Why commercial-only

Access

Triage notes

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant