Description
Did you check docs and existing issues?
- I have read all the NeMo-Guardrails docs
- I have updated the package to the latest version before submitting this issue
- (optional) I have used the develop branch
- I have searched the existing issues of NeMo-Guardrails
Python version (python --version)
Python 3.10
Operating system/version
Linux
NeMo-Guardrails version (if you must use a specific version and not the latest
0.11.0
Describe the bug
The issue
I have a custom FastAPI server integrated with NeMo Guardrails and I took notice of the streaming feature. I have been trying to to integrate the feature but failed without knowing exactly what I am doing wrong.
I have revised the documentation here, revised issues like #893, #459 or #546 and still I was not able to do proper streaming in my server.
This is how I have set up the server (different versions mean, different ways of trying to make it work + I even used inspiration from your own nemoguardrails server here):
V1
@app.post("/stream", tags=["chat"])
async def stream(request):
rails = app.rails # LLMRails Class
messages = request.messages # Chat history
# V1
async def token_generator():
streaming_handler = StreamingHandler()
asyncio.create_task(rails.generate_async(
messages=messages, streaming_handler=streaming_handler))
async for chunk in streaming_handler:
yield chunk
headers = {'X-Content-Type-Options': 'nosniff'}
return StreamingResponse(token_generator(), headers=headers, media_type='text/plain')
V2
@app.post("/stream", tags=["chat"])
async def stream(request):
rails = app.rails # LLMRails Class
messages = request.messages # Chat history
message = messages[-1]["content"] # Last message
async def llm_generator():
for chunk in rails.llm.stream(message):
yield chunk
headers = {'X-Content-Type-Options': 'nosniff'}
return StreamingResponse(llm_generator(), headers=headers, media_type='text/plain')
V3
@app.post("/stream", tags=["chat"])
async def stream(request):
rails = app.rails # LLMRails Class
messages = request.messages # Chat history
streaming_handler = StreamingHandler()
streaming_handler_var.set(streaming_handler)
streaming_handler.disable_buffer()
asyncio.create_task(rails.generate_async(
messages=messages, streaming_handler=streaming_handler))
headers = {'X-Content-Type-Options': 'nosniff'}
return StreamingResponse(streaming_handler, headers=headers, media_type='text/plain')
Except in V2 (which bypasses the guardrails and chats with llm directly), if there is a response, it seems that the final message it is returned after it is generated, meaning (I think) that there is no streaming, although the response format is like (transfer-encoding: chunked). In V2, the tokens appear on my terminal one after the other. It seems to me that the text is being streaming but stored in the buffering somewhere until the texts is finally generated or maybe there is something preventing the stream of tokens in my configuration.
Speaking about configuration, I have used 2 different providers and 2 different llms:
models:
- type: main
engine: nim
model: meta/llama-3.1-8b-instruct
parameters:
base_url: `a url/v1`
max_tokens: 200
stop:
- "\n"
- "User message:"
models:
- type: main
engine: openai
model: meta/llama-3.3-70b-instruct
parameters:
base_url: `a url/v1`
max_tokens: 200
api_key: "-"
top_p: 0.9
In both V1 and V3, both seem to work with the issue explained above, V2 only works with openai (got a AttributeError: 'AIMessageChunk' object has no attribute 'encode'
with nim, but that error is for another time).
I'm bringing the llm providers as I can dynamically switch between them and I was wondering if having different llm providers mean that I have to program my /stream
differently in each case.
Other considerations
- Using input check rails
- Yaml has the
streaming: true
tag - Using a
custom rag action
registered on the system - Working with Colang 1.0
What I think that could be the issue
- I am not setting up correctly the streaminghandler + streaming response in the api
- I am using custom actions that could block the streaming functionality
- The model is not supported
- The llm_provider is not supported with streaming
If there is anything else that it is needed to solve the issue, please feel free to ask.
Steps To Reproduce
- Create yaml and colang files
- Create FastAPI server
- Launch and try
Expected Behavior
Expected chunking behavior like langchain llm.stream()
token by token
Actual Behavior
Text appears generated after it has been finished.
Final chunk is the whole generated text.