Skip to content

bug: streaming on different llm providers behavior and api format #1008

Open
@Wildshire

Description

@Wildshire

Did you check docs and existing issues?

  • I have read all the NeMo-Guardrails docs
  • I have updated the package to the latest version before submitting this issue
  • (optional) I have used the develop branch
  • I have searched the existing issues of NeMo-Guardrails

Python version (python --version)

Python 3.10

Operating system/version

Linux

NeMo-Guardrails version (if you must use a specific version and not the latest

0.11.0

Describe the bug

The issue

I have a custom FastAPI server integrated with NeMo Guardrails and I took notice of the streaming feature. I have been trying to to integrate the feature but failed without knowing exactly what I am doing wrong.

I have revised the documentation here, revised issues like #893, #459 or #546 and still I was not able to do proper streaming in my server.

This is how I have set up the server (different versions mean, different ways of trying to make it work + I even used inspiration from your own nemoguardrails server here):

V1

@app.post("/stream", tags=["chat"])
async def stream(request):
  rails = app.rails # LLMRails Class
  messages = request.messages # Chat history
  # V1
  async def token_generator():
      streaming_handler = StreamingHandler()

      asyncio.create_task(rails.generate_async(
          messages=messages, streaming_handler=streaming_handler))
      async for chunk in streaming_handler:
          yield chunk
  headers = {'X-Content-Type-Options': 'nosniff'}
  return StreamingResponse(token_generator(), headers=headers, media_type='text/plain')

V2

@app.post("/stream", tags=["chat"])
async def stream(request):
  rails = app.rails # LLMRails Class
  messages = request.messages # Chat history
  message = messages[-1]["content"] # Last message

    async def llm_generator():
        for chunk in rails.llm.stream(message):
            yield chunk

    headers = {'X-Content-Type-Options': 'nosniff'}
    return StreamingResponse(llm_generator(), headers=headers, media_type='text/plain')

V3

@app.post("/stream", tags=["chat"])
async def stream(request):
  rails = app.rails # LLMRails Class
  messages = request.messages # Chat history
  streaming_handler = StreamingHandler()
  streaming_handler_var.set(streaming_handler)
  streaming_handler.disable_buffer()

  asyncio.create_task(rails.generate_async(
      messages=messages, streaming_handler=streaming_handler))

  headers = {'X-Content-Type-Options': 'nosniff'}
  return StreamingResponse(streaming_handler, headers=headers, media_type='text/plain')

Except in V2 (which bypasses the guardrails and chats with llm directly), if there is a response, it seems that the final message it is returned after it is generated, meaning (I think) that there is no streaming, although the response format is like (transfer-encoding: chunked). In V2, the tokens appear on my terminal one after the other. It seems to me that the text is being streaming but stored in the buffering somewhere until the texts is finally generated or maybe there is something preventing the stream of tokens in my configuration.

Speaking about configuration, I have used 2 different providers and 2 different llms:

models:
  - type: main
    engine: nim
    model: meta/llama-3.1-8b-instruct
    parameters:
      base_url: `a url/v1`
      max_tokens: 200
      stop:
        - "\n"
        - "User message:"
models:
  - type: main
    engine: openai
    model: meta/llama-3.3-70b-instruct
    parameters:
      base_url:  `a url/v1`
      max_tokens: 200
      api_key: "-"
      top_p: 0.9

In both V1 and V3, both seem to work with the issue explained above, V2 only works with openai (got a AttributeError: 'AIMessageChunk' object has no attribute 'encode' with nim, but that error is for another time).

I'm bringing the llm providers as I can dynamically switch between them and I was wondering if having different llm providers mean that I have to program my /stream differently in each case.

Other considerations

  • Using input check rails
  • Yaml has the streaming: true tag
  • Using a custom rag action registered on the system
  • Working with Colang 1.0

What I think that could be the issue

  • I am not setting up correctly the streaminghandler + streaming response in the api
  • I am using custom actions that could block the streaming functionality
  • The model is not supported
  • The llm_provider is not supported with streaming

If there is anything else that it is needed to solve the issue, please feel free to ask.

Steps To Reproduce

  1. Create yaml and colang files
  2. Create FastAPI server
  3. Launch and try

Expected Behavior

Expected chunking behavior like langchain llm.stream() token by token

Actual Behavior

Text appears generated after it has been finished.
Final chunk is the whole generated text.

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingstatus: needs infoIssues that require more information from the reporter to proceed.status: needs triageNew issues that have not yet been reviewed or categorized.status: waiting confirmationIssue is waiting confirmation whether the proposed solution/workaround works.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions