-
Notifications
You must be signed in to change notification settings - Fork 94
Add safety and content moderation with open LMs notebook #215
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
@@ -0,0 +1,1198 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My suggestion for the title and description here:
AI Guardrails: Content Moderation and Safety with Open-Source Language Models
Deploying safe and responsible AI applications requires robust guardrails to detect and handle harmful, biased, or inappropriate content. In response to this need, several open source language models have been specifically trained for content moderation, toxicity detection, and safety-related tasks.
Unlike traditional classifiers that return probabilities for predefined labels, generative models can produce natural language outputs even when used for classification, making them more adaptable for real-world moderation scenarios. To support these use cases in Haystack, we've introduced the LLMMessagesRouter
, a component that intelligently routes chat messages based on safety classifications provided by a generative language model.
In this notebook, you’ll learn how to implement AI safety mechanisms using leading open source generative models like Llama Guard (Meta), Granite Guardian (IBM), ShieldGemma (Google), and NeMo Guardrails (NVIDIA). You'll also see how to integrate content moderation into your Haystack RAG pipeline, enabling safer and more trustworthy LLM-powered applications.
Reply via ReviewNB
@@ -0,0 +1,1198 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -0,0 +1,1198 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- typo: classify the safety of the user input.
- remove responds with: Llama Guard 4 model card shows that it responds with
safe
orunsafe
Reply via ReviewNB
@@ -0,0 +1,1198 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this approach work if malicious information is somehow retrieved from the database?
Reply via ReviewNB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, in theory.
Because the text passed to the Router includes Documents coming from the Retriever.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left my comments @anakin87!
new = true | ||
|
||
[[cookbook]] | ||
title = "Safety and content moderation with Open Language Models" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest putting "guardrails" into the title
[[cookbook]] | ||
title = "Safety and content moderation with Open Language Models" | ||
notebook = "safety_moderation_open_lms.ipynb" | ||
topics = ["Safety", "Evaluation", "RAG"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, as a topic:
topics = ["Safety", "Evaluation", "RAG"] | |
topics = ["Guardrails", "Evaluation", "RAG"] |
Fixes #214
I am proposing a notebook that shows how to use the new
LLMMessagesRouter
(available from 2.15.0) to perform safety and content moderation with different open models: Llama Guard, IBM Granite Guardian, ShieldGemma, and NVIDIA NeMo Guard.It also includes an example of content moderation in a RAG pipeline.