Proposal: Add storage_options to BasePath #6307
Replies: 3 comments 4 replies
-
|
I wonder if we can maybe just call it |
Beta Was this translation helpful? Give feedback.
-
|
question: is your goal just to have a place to put these kind of options or is your goal to have a standard place that all Lance implementations understand? For example, do you expect LanceDB, Spark, and Trino to all parse these options consistently? If what you want is the former, then I think you could do this without a format change. We already have a place for table-level config here: Lines 172 to 177 in 7f6fd0e You could put some JSON in here under the key If you are aiming for consistency across engines, then I think we need to define a detailed spec for what options there are and how engines should interpret them. This is a lot more work, but I think it's necessary if you want to bake this into the table format spec. If we go this route, it's worth researching what different engines expect when it comes to configuring and mounting different storage systems. |
Beta Was this translation helpful? Give feedback.
-
If we did put the metadata into the existing map inside `Manifest, we would have to prefix it with the storage directory index so that we could understand which base it applied to. So, for example, we could have lance.storage_options.base0.azure_storage_account, lance.storage_options.base1.azure_storage_account, etc. Another option that we previously discussed was using azure paths of type
It seems to me like the spec here is actually controlled mostly by the storage providers (Microsoft, Amazon, etc. etc.) But I suppose we would have to write about how specific keys in the format mapped to specific object storage parameters, which would involve some work. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Lance supports multi-base tables, in which objects are sharded across multiple base paths. Currently, base paths are defined as follows:
Unfortunately, this is not enough information to uniquely identify base paths in several scenarios. For example, in Azure, a user may have several containers with the same name, that are in different storage accounts. In essence, the storage account is meaningfully a component of the location.
Another example is that a user may have some data in Cloudflare R2, and some data in Amazon S3. Both of them use the s3 storage provider, but with different "s3 endpoint URLs."
In order to address these scenarios, I propose to add a storage_options map to the protocol.
We could then add designated storage options to each base path for disambiguation purposes.
Note that this mechanism is not intended to store secrets, or options such as the number of retries. It is strictly intended for storage options which affect the identity of the path itself. In order to enforce this we could set up a list of allowed storage keys (the number is not that large).
Beta Was this translation helpful? Give feedback.
All reactions