Skip to content

Add safety and content moderation with open LMs notebook #215

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

anakin87
Copy link
Member

Fixes #214

I am proposing a notebook that shows how to use the new LLMMessagesRouter (available from 2.15.0) to perform safety and content moderation with different open models: Llama Guard, IBM Granite Guardian, ShieldGemma, and NVIDIA NeMo Guard.
It also includes an example of content moderation in a RAG pipeline.

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@anakin87 anakin87 marked this pull request as ready for review June 25, 2025 14:03
@anakin87 anakin87 requested a review from a team as a code owner June 25, 2025 14:03
@anakin87 anakin87 requested a review from bilgeyucel June 25, 2025 14:03
@bilgeyucel bilgeyucel self-assigned this Jun 25, 2025
@@ -0,0 +1,1198 @@
{
Copy link
Contributor

@bilgeyucel bilgeyucel Jun 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My suggestion for the title and description here:

AI Guardrails: Content Moderation and Safety with Open-Source Language Models

Deploying safe and responsible AI applications requires robust guardrails to detect and handle harmful, biased, or inappropriate content. In response to this need, several open source language models have been specifically trained for content moderation, toxicity detection, and safety-related tasks.

Unlike traditional classifiers that return probabilities for predefined labels, generative models can produce natural language outputs even when used for classification, making them more adaptable for real-world moderation scenarios. To support these use cases in Haystack, we've introduced the LLMMessagesRouter, a component that intelligently routes chat messages based on safety classifications provided by a generative language model.

In this notebook, you’ll learn how to implement AI safety mechanisms using leading open source generative models like Llama Guard (Meta), Granite Guardian (IBM), ShieldGemma (Google), and NeMo Guardrails (NVIDIA). You'll also see how to integrate content moderation into your Haystack RAG pipeline, enabling safer and more trustworthy LLM-powered applications.


Reply via ReviewNB

@@ -0,0 +1,1198 @@
{
Copy link
Contributor

@bilgeyucel bilgeyucel Jun 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...run Ollama for some open source models.


Reply via ReviewNB

@@ -0,0 +1,1198 @@
{
Copy link
Contributor

@bilgeyucel bilgeyucel Jun 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  •  typo: classify the safety of the user input.
  • remove responds with: Llama Guard 4 model card shows that it responds with safe or unsafe

Reply via ReviewNB

@@ -0,0 +1,1198 @@
{
Copy link
Contributor

@bilgeyucel bilgeyucel Jun 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this approach work if malicious information is somehow retrieved from the database?


Reply via ReviewNB

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, in theory.

Because the text passed to the Router includes Documents coming from the Retriever.

Copy link
Contributor

@bilgeyucel bilgeyucel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left my comments @anakin87!

new = true

[[cookbook]]
title = "Safety and content moderation with Open Language Models"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest putting "guardrails" into the title

[[cookbook]]
title = "Safety and content moderation with Open Language Models"
notebook = "safety_moderation_open_lms.ipynb"
topics = ["Safety", "Evaluation", "RAG"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, as a topic:

Suggested change
topics = ["Safety", "Evaluation", "RAG"]
topics = ["Guardrails", "Evaluation", "RAG"]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Notebook for content moderation with LLMMessagesRouter
2 participants