-
Notifications
You must be signed in to change notification settings - Fork 610
Support blob descriptor conversion for INSERT INTO SELECT across tables #6372
Description
Use Case
When performing INSERT INTO SELECT operations where the source table contains blob columns, users may want to:
- Select blobs from table A (possibly with complex transforms like JOINs, filters, projections)
- Insert the selected data into table B
- Defer blob materialization until the final insert time
This requires converting blob descriptors from table A into a format that table B can consume as external blob references, and then materializing them during the write.
Example scenario:
INSERT INTO target_table
SELECT
t1.id,
t1.video_blob,
t2.metadata
FROM source_table t1
JOIN metadata_table t2 ON t1.id = t2.source_id
WHERE t2.category = 'important'Relationship to OSS-665
This is complementary to OSS-665 (materializing external blob references on write):
- OSS-665: Converts external URIs (e.g.,
s3://bucket/file.mp4) → Lance-managed storage - This issue: Converts Lance table A blob descriptors → Lance table B storage
Both features share the goal of materializing blob data into the target table's storage.
Proposed Behavior
When inserting blobs from one Lance table into another:
- Convert blob descriptors: Transform the source table's blob descriptors into external blob references that point to the source table's storage
- Materialize on insert: Fetch the actual blob bytes from the source table and store them in the target table's storage
- Route through size-based logic: Apply the standard inline/packed/dedicated storage decision based on blob size
Why Always Materialize?
It is always beneficial to materialize the blob in this case because:
- Cleanup complexity: If table B references blobs stored in table A, deleting or compacting table A becomes hazardous — we'd need cross-table dependency tracking
- Lifecycle independence: Each table should fully own its data for independent versioning, compaction, and deletion
- Storage isolation: Avoids subtle bugs where table B becomes corrupted if table A is modified or deleted
API Considerations
The blob descriptor conversion could be exposed as:
Internal/Automatic
During INSERT INTO SELECT, Lance automatically detects blob columns from other tables and handles the conversion + materialization transparently.
Explicit API
# Convert blob descriptors for cross-table insert
converted_blobs = lance.convert_blob_descriptors(
source_table="./source.lance",
blob_column="video_blob",
row_ids=[1, 2, 3], # or a filter expression
)Implementation Considerations
- Lazy evaluation: The blob bytes should not be fetched until the actual write occurs
- Batch processing: Support converting multiple blob descriptors efficiently
- Error handling: Clear errors when source table blobs are inaccessible
- Version awareness: Handle cases where source table version changes between SELECT and INSERT
Related
- OSS-665 - Support materializing external blob references on write