Skip to content

feat(guardrails): add Cleanlab TLM hallucination/trustworthiness guardrail#170

Draft
dni138 wants to merge 1 commit into
mainfrom
cleanlab-tlm
Draft

feat(guardrails): add Cleanlab TLM hallucination/trustworthiness guardrail#170
dni138 wants to merge 1 commit into
mainfrom
cleanlab-tlm

Conversation

@dni138
Copy link
Copy Markdown
Contributor

@dni138 dni138 commented May 22, 2026

Summary

Adds CleanlabTlm guardrail wrapping Cleanlab's Trustworthy Language Model — a hosted scoring service that returns a trustworthiness score for any (prompt, response) pair. Designed for hallucination detection in RAG and agent pipelines.

Research backing

Why commercial-only

TLM combines black-box intrinsic + extrinsic uncertainty signals over arbitrary LLM APIs as a hosted scoring service. No OSS package replicates the full pipeline (the underlying BSDetector method is published, but the production scoring service with its calibration and multi-signal fusion is closed-source).

Access

  • Free $5 of credits at signup.
  • Signup: https://tlm.cleanlab.ai/ → API key in dashboard.
  • Env var: CLEANLAB_TLM_API_KEY.

Triage notes

  • Not yet integration-tested — awaiting API key procurement.
  • Adds new optional extra: pip install 'any-guardrail[cleanlab-tlm]'.
  • Output semantics inverted: higher score = more trustworthy = more valid. Default threshold=0.7.
  • Inherits from Guardrail (not ThreeStageGuardrail) because validate(prompt, response) doesn't match the standard (input_text) signature.

Test plan

  • Provision a Cleanlab TLM API key
  • Add integration test under tests/integration/ with @pytest.mark.e2e using a known hallucinated response
  • Cookbook entry showing RAG-grounding use case

🤖 Generated with Claude Code

…drail

Wraps Cleanlab's Trustworthy Language Model (TLM) as a hosted scoring guardrail
for detecting hallucinations and low-confidence answers in RAG/agent pipelines.
Inherits from Guardrail (not ThreeStageGuardrail) because validate() takes
(prompt, response) instead of the standard (input_text) signature. Output
semantics are inverted: higher trustworthiness score means more valid.

Adds a new optional extra `cleanlab-tlm` (also pulled in by `all`). Unit tests
mock the `cleanlab_tlm.TLM` client so no real API calls are made; integration
testing is deferred until an API key is provisioned.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant