-
Notifications
You must be signed in to change notification settings - Fork 445
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bug: streaming on different llm providers behavior and api format #1008
Comments
Hi @Wildshire , thanks for opening such a complete issue 👍🏻 I will look into it in details later but I'd like you to try out following:
Preferably first try it without updating your endpoints implementation. from nemoguardrails import RailsConfig, LLMRails
config = RailsConfig.from_path("PATH/TO/CONFIG")
rails = LLMRailsConfig(config)
messages = [{"role": "user", "content": "what can you do?"}]
async for chunk in rails.stream_async(messages=messages):
print(f"CHUNK:{chunk}") Let me know how it works. |
Hi @Pouyanpi , thanks a lot for the quick response. I did a small script with that code like this: import asyncio
from nemoguardrails import RailsConfig, LLMRails
async def demo():
config = RailsConfig.from_path("./config", verbose=True)
rails = LLMRails(config)
messages = [{"role": "user", "content": "hi there"}]
async for chunk in rails.stream_async(messages=messages):
print(f"CHUNK:{chunk}")
if __name__ == "__main__":
asyncio.run(demo()) without changing the llm endpoint. From my trials I got CHUNK: (A very long text) after waiting a couple of seconds (the full I also tried disabling the RAG action just in case, but got same results. Not sure if this can help, but from the NIM
OPENAI Invocation Params {'model_name': 'meta/llama-3.3-70b-instruct', 'temperature': 0.7, 'top_p': 0.9,
'frequency_penalty': 0, 'presence_penalty': 0, 'n': 1, 'seed': None, 'logprobs': None, 'max_tokens': 200, '_type': 'openai',
'stop': None} Maybe the streaming parameter is not injected properly? |
Hi @Wildshire , Please make sure that You can also do config.streaming = True Then passing it to |
Hello again @Pouyanpi I have done more testing and I found more interesting insights. In my original config.yaml colang_version: "1.0"
streaming: true
instructions:
- type: general
content: |
`general instructions`
models:
- type: main
engine: openai
model: meta/llama-3.3-70b-instruct
parameters:
base_url: some_url
max_tokens: 200
api_key: "-"
top_p: 0.9
# models:
# - type: main
# engine: nim
# model: meta/llama-3.1-8b-instruct
# parameters:
# base_url: some_url
# max_tokens: 200
core:
embedding_search_provider:
name: default
parameters:
embedding_engine: FastEmbed
embedding_model: BAAI/bge-small-en-v1.5
use_batching: true
max_batch_size: 10
max_batch_hold: 0.01
cache:
enabled: true
key_generator: md5
store: filesystem
rails:
input:
flows:
- self check input
# Collection of all the prompts
prompts:
- task: general
models:
- llama3
- llama-3
messages:
- type: system
content: |
a long text
# Prompt for detecting the user message canonical form.
- task: generate_user_intent
models:
- llama3
- llama-3
messages:
- type: system
content: |
A long text
output_parser: "verbose_v1"
# Prompt for generating the next steps.
- task: generate_next_steps
models:
- llama3
- llama-3
messages:
- type: system
content: |
a long text
output_parser: "verbose_v1"
# Prompt for generating the bot message from a canonical form.
- task: generate_bot_message
models:
- llama3
- llama-3
messages:
- type: system
content: |
a long text
max_length: 50000
output_parser: "verbose_v1"
# Prompt for generating the value of a context variable.
- task: generate_value
models:
- llama3
- llama-3
messages:
- type: system
content: |
a long text
output_parser: "verbose_v1"
# Prompt for checking if user input is ethical and legal
- task: self_check_input
messages:
- type: system
content: |-
a long text
max_tokens: 5 and my rails were like input_check.co define flow self check input
$allowed = execute self_check_input
if not $allowed
bot do smth
stop I noticed that if I commented As I wanted my custom templates, I looked at this example and did another demo with my custom config (all uncommented) + rails:
input:
flows:
- self check input
dialog:
user_messages:
embeddings_only: True And the provided rail define user ask question
"..."
define flow
user ...
# Here we call the custom action which will
$result = execute call_llm(user_query=$user_message)
# In this case, we also return the result as the final message.
# This is optional.
bot $result And everything was working fine with the chunking. So, in a nutshell:
|
Did you check docs and existing issues?
Python version (python --version)
Python 3.10
Operating system/version
Linux
NeMo-Guardrails version (if you must use a specific version and not the latest
0.11.0
Describe the bug
The issue
I have a custom FastAPI server integrated with NeMo Guardrails and I took notice of the streaming feature. I have been trying to to integrate the feature but failed without knowing exactly what I am doing wrong.
I have revised the documentation here, revised issues like #893, #459 or #546 and still I was not able to do proper streaming in my server.
This is how I have set up the server (different versions mean, different ways of trying to make it work + I even used inspiration from your own nemoguardrails server here):
V1
V2
V3
Except in V2 (which bypasses the guardrails and chats with llm directly), if there is a response, it seems that the final message it is returned after it is generated, meaning (I think) that there is no streaming, although the response format is like (transfer-encoding: chunked). In V2, the tokens appear on my terminal one after the other. It seems to me that the text is being streaming but stored in the buffering somewhere until the texts is finally generated or maybe there is something preventing the stream of tokens in my configuration.
Speaking about configuration, I have used 2 different providers and 2 different llms:
In both V1 and V3, both seem to work with the issue explained above, V2 only works with openai (got a
AttributeError: 'AIMessageChunk' object has no attribute 'encode'
with nim, but that error is for another time).I'm bringing the llm providers as I can dynamically switch between them and I was wondering if having different llm providers mean that I have to program my
/stream
differently in each case.Other considerations
streaming: true
tagcustom rag action
registered on the systemWhat I think that could be the issue
If there is anything else that it is needed to solve the issue, please feel free to ask.
Steps To Reproduce
Expected Behavior
Expected chunking behavior like langchain
llm.stream()
token by tokenActual Behavior
Text appears generated after it has been finished.
Final chunk is the whole generated text.
The text was updated successfully, but these errors were encountered: