Skip to content

Support blob descriptor conversion for INSERT INTO SELECT across tables #6372

@jackye1995

Description

@jackye1995

Use Case

When performing INSERT INTO SELECT operations where the source table contains blob columns, users may want to:

  1. Select blobs from table A (possibly with complex transforms like JOINs, filters, projections)
  2. Insert the selected data into table B
  3. Defer blob materialization until the final insert time

This requires converting blob descriptors from table A into a format that table B can consume as external blob references, and then materializing them during the write.

Example scenario:

INSERT INTO target_table
SELECT 
    t1.id, 
    t1.video_blob,
    t2.metadata
FROM source_table t1
JOIN metadata_table t2 ON t1.id = t2.source_id
WHERE t2.category = 'important'

Relationship to OSS-665

This is complementary to OSS-665 (materializing external blob references on write):

  • OSS-665: Converts external URIs (e.g., s3://bucket/file.mp4) → Lance-managed storage
  • This issue: Converts Lance table A blob descriptors → Lance table B storage

Both features share the goal of materializing blob data into the target table's storage.

Proposed Behavior

When inserting blobs from one Lance table into another:

  1. Convert blob descriptors: Transform the source table's blob descriptors into external blob references that point to the source table's storage
  2. Materialize on insert: Fetch the actual blob bytes from the source table and store them in the target table's storage
  3. Route through size-based logic: Apply the standard inline/packed/dedicated storage decision based on blob size

Why Always Materialize?

It is always beneficial to materialize the blob in this case because:

  • Cleanup complexity: If table B references blobs stored in table A, deleting or compacting table A becomes hazardous — we'd need cross-table dependency tracking
  • Lifecycle independence: Each table should fully own its data for independent versioning, compaction, and deletion
  • Storage isolation: Avoids subtle bugs where table B becomes corrupted if table A is modified or deleted

API Considerations

The blob descriptor conversion could be exposed as:

Internal/Automatic

During INSERT INTO SELECT, Lance automatically detects blob columns from other tables and handles the conversion + materialization transparently.

Explicit API

# Convert blob descriptors for cross-table insert
converted_blobs = lance.convert_blob_descriptors(
    source_table="./source.lance",
    blob_column="video_blob",
    row_ids=[1, 2, 3],  # or a filter expression
)

Implementation Considerations

  • Lazy evaluation: The blob bytes should not be fetched until the actual write occurs
  • Batch processing: Support converting multiple blob descriptors efficiently
  • Error handling: Clear errors when source table blobs are inaccessible
  • Version awareness: Handle cases where source table version changes between SELECT and INSERT

Related

  • OSS-665 - Support materializing external blob references on write

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions