bug: streaming on different llm providers behavior and api format

### Did you check docs and existing issues?

- [x] I have read all the NeMo-Guardrails docs
- [x] I have updated the package to the latest version before submitting this issue
- [ ] (optional) I have used the develop branch
- [x] I have searched the existing issues of NeMo-Guardrails

### Python version (python --version)

Python 3.10

### Operating system/version

Linux

### NeMo-Guardrails version (if you must use a specific version and not the latest

0.11.0

### Describe the bug

# The issue
I have a custom FastAPI server integrated with NeMo Guardrails and I took notice of the streaming feature. I have been trying to to integrate the feature but failed without knowing exactly what I am doing wrong.

I have revised the documentation [here](https://github.com/NVIDIA/NeMo-Guardrails/blob/develop/docs/user-guides/advanced/streaming.md), revised issues like #893, #459 or #546 and still I was not able to do proper streaming in my server.

This is how I have set up the server (different versions mean, different ways of trying to make it work + I even used inspiration from your own nemoguardrails server [here](https://github.com/NVIDIA/NeMo-Guardrails/blob/36ccc2018b334780ecf229d11542f23e5fdbc968/nemoguardrails/server/api.py#L348)):

V1

```python
@app.post("/stream", tags=["chat"])
async def stream(request):
  rails = app.rails # LLMRails Class
  messages = request.messages # Chat history
  # V1
  async def token_generator():
      streaming_handler = StreamingHandler()

      asyncio.create_task(rails.generate_async(
          messages=messages, streaming_handler=streaming_handler))
      async for chunk in streaming_handler:
          yield chunk
  headers = {'X-Content-Type-Options': 'nosniff'}
  return StreamingResponse(token_generator(), headers=headers, media_type='text/plain')
```

V2

```python
@app.post("/stream", tags=["chat"])
async def stream(request):
  rails = app.rails # LLMRails Class
  messages = request.messages # Chat history
  message = messages[-1]["content"] # Last message

    async def llm_generator():
        for chunk in rails.llm.stream(message):
            yield chunk

    headers = {'X-Content-Type-Options': 'nosniff'}
    return StreamingResponse(llm_generator(), headers=headers, media_type='text/plain')
```

V3
```python
@app.post("/stream", tags=["chat"])
async def stream(request):
  rails = app.rails # LLMRails Class
  messages = request.messages # Chat history
  streaming_handler = StreamingHandler()
  streaming_handler_var.set(streaming_handler)
  streaming_handler.disable_buffer()

  asyncio.create_task(rails.generate_async(
      messages=messages, streaming_handler=streaming_handler))

  headers = {'X-Content-Type-Options': 'nosniff'}
  return StreamingResponse(streaming_handler, headers=headers, media_type='text/plain')
```

Except in V2 (which bypasses the guardrails and chats with llm directly), if there is a response, it seems that the final message it is returned *after* it is generated, meaning (I think) that there is no streaming, although the response format is like (transfer-encoding: chunked). In V2, the tokens appear on my terminal one after the other. It seems to me that the text is being streaming but stored in the buffering somewhere until the texts is finally generated or maybe there is something preventing the stream of tokens in my configuration.

Speaking about configuration, I have used 2 different providers and 2 different llms:

```yaml
models:
  - type: main
    engine: nim
    model: meta/llama-3.1-8b-instruct
    parameters:
      base_url: `a url/v1`
      max_tokens: 200
      stop:
        - "\n"
        - "User message:"
```

```yaml
models:
  - type: main
    engine: openai
    model: meta/llama-3.3-70b-instruct
    parameters:
      base_url:  `a url/v1`
      max_tokens: 200
      api_key: "-"
      top_p: 0.9
```
In both V1 and V3, both seem to work with the issue explained above, V2 only works with openai (got a `AttributeError: 'AIMessageChunk' object has no attribute 'encode'` with nim, but that error is for another time).

I'm bringing the llm providers as I can dynamically switch between them and I was wondering if having different llm providers mean that I have to program my `/stream` differently in each case.

# Other considerations
- Using input check rails
- Yaml has the `streaming: true` tag
- Using a `custom rag action` registered on the system
- **Working with Colang 1.0**

# What I think that could be the issue
- I am not setting up correctly the streaminghandler + streaming response in the api
- I am using custom actions that could block the streaming functionality
- The model is not supported
- The llm_provider is not supported with streaming

If there is anything else that it is needed to solve the issue, please feel free to ask. 


### Steps To Reproduce

1. Create yaml and colang files
2. Create FastAPI server
3. Launch and try

### Expected Behavior

Expected chunking behavior like langchain `llm.stream()` token by token

### Actual Behavior

Text appears generated after it has been finished. 
Final chunk is the whole generated text.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

bug: streaming on different llm providers behavior and api format #1008

Did you check docs and existing issues?

Python version (python --version)

Operating system/version

NeMo-Guardrails version (if you must use a specific version and not the latest

Describe the bug

The issue

Other considerations

What I think that could be the issue

Steps To Reproduce

Expected Behavior

Actual Behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

bug: streaming on different llm providers behavior and api format #1008

Description

Did you check docs and existing issues?

Python version (python --version)

Operating system/version

NeMo-Guardrails version (if you must use a specific version and not the latest

Describe the bug

The issue

Other considerations

What I think that could be the issue

Steps To Reproduce

Expected Behavior

Actual Behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions