Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: streaming on different llm providers behavior and api format #1008

Open
3 of 4 tasks
Wildshire opened this issue Feb 20, 2025 · 4 comments
Open
3 of 4 tasks

bug: streaming on different llm providers behavior and api format #1008

Wildshire opened this issue Feb 20, 2025 · 4 comments
Assignees
Labels
bug Something isn't working status: needs info Issues that require more information from the reporter to proceed. status: needs triage New issues that have not yet been reviewed or categorized. status: waiting confirmation Issue is waiting confirmation whether the proposed solution/workaround works.

Comments

@Wildshire
Copy link

Did you check docs and existing issues?

  • I have read all the NeMo-Guardrails docs
  • I have updated the package to the latest version before submitting this issue
  • (optional) I have used the develop branch
  • I have searched the existing issues of NeMo-Guardrails

Python version (python --version)

Python 3.10

Operating system/version

Linux

NeMo-Guardrails version (if you must use a specific version and not the latest

0.11.0

Describe the bug

The issue

I have a custom FastAPI server integrated with NeMo Guardrails and I took notice of the streaming feature. I have been trying to to integrate the feature but failed without knowing exactly what I am doing wrong.

I have revised the documentation here, revised issues like #893, #459 or #546 and still I was not able to do proper streaming in my server.

This is how I have set up the server (different versions mean, different ways of trying to make it work + I even used inspiration from your own nemoguardrails server here):

V1

@app.post("/stream", tags=["chat"])
async def stream(request):
  rails = app.rails # LLMRails Class
  messages = request.messages # Chat history
  # V1
  async def token_generator():
      streaming_handler = StreamingHandler()

      asyncio.create_task(rails.generate_async(
          messages=messages, streaming_handler=streaming_handler))
      async for chunk in streaming_handler:
          yield chunk
  headers = {'X-Content-Type-Options': 'nosniff'}
  return StreamingResponse(token_generator(), headers=headers, media_type='text/plain')

V2

@app.post("/stream", tags=["chat"])
async def stream(request):
  rails = app.rails # LLMRails Class
  messages = request.messages # Chat history
  message = messages[-1]["content"] # Last message

    async def llm_generator():
        for chunk in rails.llm.stream(message):
            yield chunk

    headers = {'X-Content-Type-Options': 'nosniff'}
    return StreamingResponse(llm_generator(), headers=headers, media_type='text/plain')

V3

@app.post("/stream", tags=["chat"])
async def stream(request):
  rails = app.rails # LLMRails Class
  messages = request.messages # Chat history
  streaming_handler = StreamingHandler()
  streaming_handler_var.set(streaming_handler)
  streaming_handler.disable_buffer()

  asyncio.create_task(rails.generate_async(
      messages=messages, streaming_handler=streaming_handler))

  headers = {'X-Content-Type-Options': 'nosniff'}
  return StreamingResponse(streaming_handler, headers=headers, media_type='text/plain')

Except in V2 (which bypasses the guardrails and chats with llm directly), if there is a response, it seems that the final message it is returned after it is generated, meaning (I think) that there is no streaming, although the response format is like (transfer-encoding: chunked). In V2, the tokens appear on my terminal one after the other. It seems to me that the text is being streaming but stored in the buffering somewhere until the texts is finally generated or maybe there is something preventing the stream of tokens in my configuration.

Speaking about configuration, I have used 2 different providers and 2 different llms:

models:
  - type: main
    engine: nim
    model: meta/llama-3.1-8b-instruct
    parameters:
      base_url: `a url/v1`
      max_tokens: 200
      stop:
        - "\n"
        - "User message:"
models:
  - type: main
    engine: openai
    model: meta/llama-3.3-70b-instruct
    parameters:
      base_url:  `a url/v1`
      max_tokens: 200
      api_key: "-"
      top_p: 0.9

In both V1 and V3, both seem to work with the issue explained above, V2 only works with openai (got a AttributeError: 'AIMessageChunk' object has no attribute 'encode' with nim, but that error is for another time).

I'm bringing the llm providers as I can dynamically switch between them and I was wondering if having different llm providers mean that I have to program my /stream differently in each case.

Other considerations

  • Using input check rails
  • Yaml has the streaming: true tag
  • Using a custom rag action registered on the system
  • Working with Colang 1.0

What I think that could be the issue

  • I am not setting up correctly the streaminghandler + streaming response in the api
  • I am using custom actions that could block the streaming functionality
  • The model is not supported
  • The llm_provider is not supported with streaming

If there is anything else that it is needed to solve the issue, please feel free to ask.

Steps To Reproduce

  1. Create yaml and colang files
  2. Create FastAPI server
  3. Launch and try

Expected Behavior

Expected chunking behavior like langchain llm.stream() token by token

Actual Behavior

Text appears generated after it has been finished.
Final chunk is the whole generated text.

@Wildshire Wildshire added bug Something isn't working status: needs triage New issues that have not yet been reviewed or categorized. labels Feb 20, 2025
@Pouyanpi
Copy link
Collaborator

Pouyanpi commented Feb 21, 2025

Hi @Wildshire , thanks for opening such a complete issue 👍🏻

I will look into it in details later but I'd like you to try out following:

async for chunk in rails.stream_async(messages=messages):
    print(f"CHUNK:{chunk}")

Preferably first try it without updating your endpoints implementation.

from nemoguardrails import RailsConfig, LLMRails
config = RailsConfig.from_path("PATH/TO/CONFIG")
rails = LLMRailsConfig(config)
messages = [{"role": "user", "content": "what can you do?"}]
async for chunk in rails.stream_async(messages=messages):
    print(f"CHUNK:{chunk}")

Let me know how it works.

@Pouyanpi Pouyanpi self-assigned this Feb 21, 2025
@Wildshire
Copy link
Author

Hi @Pouyanpi , thanks a lot for the quick response.

I did a small script with that code like this:

import asyncio
from nemoguardrails import RailsConfig, LLMRails


async def demo():
    config = RailsConfig.from_path("./config", verbose=True)
    rails = LLMRails(config)

    messages = [{"role": "user", "content": "hi there"}]
    async for chunk in rails.stream_async(messages=messages):
        print(f"CHUNK:{chunk}")

if __name__ == "__main__":
    asyncio.run(demo())

without changing the llm endpoint. From my trials I got

CHUNK: (A very long text)

after waiting a couple of seconds (the full LLM Call).

I also tried disabling the RAG action just in case, but got same results.

Not sure if this can help, but from the InvocationParams I got this in the generate_bot_message step

NIM

Invocation Params {'_type': 'chat-nvidia-ai-playground', 'stop': None}

OPENAI

Invocation Params {'model_name': 'meta/llama-3.3-70b-instruct', 'temperature': 0.7, 'top_p': 0.9,
'frequency_penalty': 0, 'presence_penalty': 0, 'n': 1, 'seed': None, 'logprobs': None, 'max_tokens': 200, '_type': 'openai',
'stop': None}

Maybe the streaming parameter is not injected properly?

@Pouyanpi
Copy link
Collaborator

Pouyanpi commented Feb 21, 2025

Hi @Wildshire ,

Please make sure that streaming: True is set in the config.yml

You can also do

config.streaming = True

Then passing it to LLMRails

@Pouyanpi Pouyanpi added status: needs info Issues that require more information from the reporter to proceed. status: waiting confirmation Issue is waiting confirmation whether the proposed solution/workaround works. labels Feb 27, 2025
@Wildshire
Copy link
Author

Wildshire commented Feb 28, 2025

Hello again @Pouyanpi

I have done more testing and I found more interesting insights. In my original config.yaml file, I was overwritting some of the internal prompts to fully customize the guardrails:

config.yaml

colang_version: "1.0"
streaming: true

instructions:
- type: general
  content: |
    `general instructions`

models:
  - type: main
    engine: openai
    model: meta/llama-3.3-70b-instruct
    parameters:
      base_url: some_url
      max_tokens: 200
      api_key: "-"
      top_p: 0.9

# models:
#   - type: main
#     engine: nim
#     model: meta/llama-3.1-8b-instruct
#     parameters:
#       base_url: some_url
#       max_tokens: 200

core:
  embedding_search_provider:
    name: default
    parameters:
      embedding_engine: FastEmbed
      embedding_model: BAAI/bge-small-en-v1.5
      use_batching: true
      max_batch_size: 10
      max_batch_hold: 0.01
    cache:
      enabled: true
      key_generator: md5
      store: filesystem

rails:
  input:
    flows:
    - self check input


# Collection of all the prompts
prompts:
  - task: general
    models:
      - llama3
      - llama-3
    messages:
      - type: system
        content: |
          a long text


  # Prompt for detecting the user message canonical form.
  - task: generate_user_intent
    models:
      - llama3
      - llama-3
    messages:
      - type: system
        content: |
          A long text
    output_parser: "verbose_v1"


  # Prompt for generating the next steps.
  - task: generate_next_steps
    models:
      - llama3
      - llama-3
    messages:
      - type: system
        content: |
          a long text

    output_parser: "verbose_v1"


  # Prompt for generating the bot message from a canonical form.
  - task: generate_bot_message
    models:
      - llama3
      - llama-3
    messages:
      - type: system
        content: |
            a long text
    max_length: 50000
    output_parser: "verbose_v1"



  # Prompt for generating the value of a context variable.
  - task: generate_value
    models:
      - llama3
      - llama-3
    messages:
      - type: system
        content: |
          a long text

    output_parser: "verbose_v1"


  # Prompt for checking if user input is ethical and legal
  - task: self_check_input
    messages:
      - type: system
        content: |-
          a long text
    max_tokens: 5

and my rails were like

input_check.co

define flow self check input
  $allowed = execute self_check_input

  if not $allowed
    bot do smth
    stop

I noticed that if I commented generate_next_steps and generate_bot_message (so then the system uses the default ones) the chunking was all fine with both models.

As I wanted my custom templates, I looked at this example and did another demo with my custom config (all uncommented) +

rails:
  input:
    flows:
    - self check input
  dialog:
    user_messages:
      embeddings_only: True

And the provided rail

define user ask question
    "..."

define flow
    user ...
    # Here we call the custom action which will
    $result = execute call_llm(user_query=$user_message)

    # In this case, we also return the result as the final message.
    # This is optional.
    bot $result

And everything was working fine with the chunking.

So, in a nutshell:

  • Can you confirm if customizing the internal prompts may be the issue behind the streaming being blocked or changed.
  • For the moment, as the second case suits mine, I am going to adapt it into the server to see if the streaming_handler + Streaming Response works as intended, so I will keep you posted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working status: needs info Issues that require more information from the reporter to proceed. status: needs triage New issues that have not yet been reviewed or categorized. status: waiting confirmation Issue is waiting confirmation whether the proposed solution/workaround works.
Projects
None yet
Development

No branches or pull requests

2 participants