Library for uploading blobs as part of instrumentation #GenAI #MultiModal

# What problem do you want to solve?

Many (if not most) observability backends are not built to accept arbitrarily large blobs.

However, GenAI observability introduces a use case where users want to log full prompt/response data. In "multi-modal" use cases where large files (PDFs, PNGs, JPEGs, etc.) are involved, users still want these to be recorded. An example use case: the user prompts an LLM with a PDF and asks a question about it, asking to generate an infographic summarizing the essential details; the LLM responds with summary text as well as a JPEG infographic summarizing the salient points.  GenAI Semantic conventions are aiming towards recording the prompt/response details in event bodies (which end up in logs), but many logging backends are not capable of receiving these large blob payloads.

In [OpenLLMetry](https://github.com/traceloop/openllmetry), this has been addressed with an [`ImageUploader`](https://github.com/traceloop/openllmetry/blob/8b1f29249e7094dcfd7319a85766a8ac6557f698/packages/traceloop-sdk/traceloop/sdk/images/image_uploader.py), which is very specific to the Traceloop backend. It is likely that generic instrumentation of GenAI frameworks (such as moving frameworks from OpenLLMetry to the OTel Python repo) may require a more generic alternative to provide such capability/functionality.

# Describe the solution you'd like

At a high-level, I would like to separate this out into:

 1. **A consumption interface** that is aimed at implementors of instrumentation packages.
 2. **A producer interface/library** that is aimed at those trying to provide a storage backend for such blobs.
 3. **Default implementations** that use the above to provide useful out-of-the-box functionality
 4.  **Conventions** for actually making use of this mechanism

## Common

### NOT_UPLOADED

A constant used as a response when failing to upload.

```
NOT_UPLOADED = '/dev/null'
```

### Blob

Encapsulates the raw payload together with associated properties:

```
    class Blob(object):

        def __init__(self, raw_bytes: bytes, content_type: Optional[str]=None, labels: Optional[dict]=None):
            ...

        @staticmethod
        def from_data_uri(cls, uri: str, labels: Optional[dict]=None) -> Blob:
          ...

        @property
        def raw_bytes(self) -> bytes:
            ...
  
        @property
        def content_type(self) -> Optional[str]:
            ...

         @property
         def labels(self) -> dict:
             ... 
```

The `from_data_uri` function can construct a `Blob` from a URI like `data:image/jpeg;base64,...` (or other, similar, base64-encoded data URIs).

## Consumption Interface

### get_blob_uploader

A function used in instrumentation to retrieve the uploader for a certain context/usage:

```
   def get_blob_uploader(use_case: Optional[str] = None) -> BlobUploader:
       ...
```

This allows different uploaders to be configured for different contexts/usages (think the `logger` library in Python).  It always returns an uploader of some kind (falling back to a default that reports an error and drops the data).

### BlobUploader

Provides a way to upload data asynchronously, getting the URL to which the data is expected to land.

```
class BlobUploader(ABC):

   @abstractmethod
   def upload_async(self, blob: Blob) -> str:
         ...
```

The `upload_async` function returns quickly with a URL where the data is expected to get uploaded and, in the background, will attempt to write the data to the returned URL. May return `NOT_UPLOADED` if uploading is disabled.

### Helpers: `detect_content_type`, `generate_labels_for_X`

If instrumentation does not know the content type, it may use the following to generate it:

````
   def detect_content_type(raw_bytes: bytes) -> str:
      ...
````

Instrumentation should use helpers such as the below to populate a minimum set of labels:

```
    def generate_labels_for_span(trace_id, span_id) -> dict:
       ...

    def generate_labels_for_span_event(trace_id, span_id, event_name, event_index) -> dict:
       ...

    ...
```

The above ensures that `labels` includes a minimum subset of metadata needed to tie back to the original source.


## Provider Interface

### SimpleBlobUploader

Base class that providers can use to implement blob uploading without dealing with asynchronous processing, queuing, other associated details.

```
class SimpleBlobUploader(ABC):

   @abstractmethod
   def generate_destination_uri(self, blob: Blob) -> str
      ...

   @abstractmethod
   def upload_sync(self, uri: str, blob: Blob):
      ...
```

### blob_uploader_from_simple_blob_uploader

Method that constructs a `BlobUploader` given the simpler interface:

```
  def blob_uploader_from_simple_blob_uploader(simple_uploader: SimpleBlobUploader) -> BlobUploader:
     ....
```

## Default implementations

Ideally, it should be possible to specify the following environment variables:

 - `BLOB_UPLOAD_ENABLED`
 - `BLOB_UPLOAD_URI_PREFIX`

... and to have things work "out of the box" where `BLOB_UPLOAD_ENABLED` is true and where `BLOB_UPLOAD_URI_PREFIX` starts with any of the following:

   -  "gs://GCS_BUCKET_NAME" (Google Cloud Storage)
   -  "s3://S3_BUCKET_NAME" (Amazon S3)
   -  "azblob://ACCOUNT/CONTAINER" (Azure Blob)

It should be easy for additional providers to be added, in support of additional prefixes.

## Conventions

The above proposal is independent of conventions for actually putting this into use/practice. 

I had originally envisioned a more expansive set of conventions for this ([Blob Reference Properties](https://github.com/open-telemetry/semantic-conventions/pull/1521)), but I think it's also fine that narrower-scoped conventions are defined where needed (e.g. within GenAI, for a specific GenAI event type, proposing that a specific event body field be a URI referencing content that has been uploaded). This proposal is focused primarily on the technical details of _how_ such uploading be performed and less on _when_ to do so.

I'd ideally like to keep these two things orthogonal, focusing in this proposal on enabling the capability of having such properties come into existence and, separately, define when we take advantage of this capability.

# Describe alternatives you've considered

## Not separating out "SimpleBlobUploader" and "blob_uploader_from_simple_blob_uploader"

This would likely lead to duplication of effort related to handling of async processing, however. It might also make it difficult to centrally control configuration related to the async processing such as the size of the queue, the number of threads to dedicate to this background processing, whether to use threads or processes, etc.

## Not providing "labels"

This would make it hard to include metadata in the upload that makes it possible to link from a blob to the observability that generated it. For example, suppose that GCS, S3, Azure Blob, or some other provider wanted to enable navigation from the blob in that system to the trace, span, event, etc. that generated the blob; this would be difficult without metadata.

## Naming this "ImageUploader"

Although this is what OpenLLMetry calls it, the intended usage is broader in scope.

## Naming this something that includes "GenAI"

Although GenAI instrumentation is the motivating use case, this could in theory be useful for other uses. Additionally, while this is for GenAI O11y, this does not make use of GenAI in its implementation which might be suggested by such a name.

## Making it strictly synchronous

This is not likely going to fly in instrumentation, where blocking on the upload would introduce excessive latency.

## Not having a `use_case` parameter to `get_blob_uploader`

In the short-term, this would be OK. But this could preclude future enhancements such as to dispatch to different uploader implementations depending on the caller/user of the library. For example, when debugging, it may be useful to capture only a portion of uploads to a local folder while otherwise not injecting/modifying uploads in general. Having this kind of parameter may help to implement such a use case down the line.

## Raising an exception instead of returning `NOT_UPLOADED`

This is a viable alternative approach. However, I imagine that this could lead to accidental failure to handle the exception, whereas recording "/dev/null" as the value of a property indicating that the data never got uploaded is more likely to be OK. In this case, I'm erring on the side of failing gracefully rather than failing loudly (since the severity of the failure is not significant and can mostly be ignored).

## Providing a way to be notified when the upload completes or if it fails

I think this could easily be added in the future without breaking changes, but it creates extra complexity. I'd prefer to omit this piece from an initial solution, since I think this complexity will have little return-on-investment.


### Additional Context

_No response_

### Would you like to implement a fix?

Yes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Library for uploading blobs as part of instrumentation #GenAI #MultiModal #3065

What problem do you want to solve?

Describe the solution you'd like

Common

NOT_UPLOADED

Blob

Consumption Interface

get_blob_uploader

BlobUploader

Helpers: `detect_content_type`, `generate_labels_for_X`

Provider Interface

SimpleBlobUploader

blob_uploader_from_simple_blob_uploader

Default implementations

Conventions

Describe alternatives you've considered

Not separating out "SimpleBlobUploader" and "blob_uploader_from_simple_blob_uploader"

Not providing "labels"

Naming this "ImageUploader"

Naming this something that includes "GenAI"

Making it strictly synchronous

Not having a `use_case` parameter to `get_blob_uploader`

Raising an exception instead of returning `NOT_UPLOADED`

Providing a way to be notified when the upload completes or if it fails

Additional Context

Would you like to implement a fix?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Library for uploading blobs as part of instrumentation #GenAI #MultiModal #3065

Description

What problem do you want to solve?

Describe the solution you'd like

Common

NOT_UPLOADED

Blob

Consumption Interface

get_blob_uploader

BlobUploader

Helpers: detect_content_type, generate_labels_for_X

Provider Interface

SimpleBlobUploader

blob_uploader_from_simple_blob_uploader

Default implementations

Conventions

Describe alternatives you've considered

Not separating out "SimpleBlobUploader" and "blob_uploader_from_simple_blob_uploader"

Not providing "labels"

Naming this "ImageUploader"

Naming this something that includes "GenAI"

Making it strictly synchronous

Not having a use_case parameter to get_blob_uploader

Raising an exception instead of returning NOT_UPLOADED

Providing a way to be notified when the upload completes or if it fails

Additional Context

Would you like to implement a fix?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Helpers: `detect_content_type`, `generate_labels_for_X`

Not having a `use_case` parameter to `get_blob_uploader`

Raising an exception instead of returning `NOT_UPLOADED`