Clearly separate Lance Namespace and Lance Catalog #6109
Replies: 2 comments 2 replies
-
|
Love it. I’m also thinking it might make sense to add global lock support at the catalog layer—what do you think? |
Beta Was this translation helpful? Give feedback.
-
Since the partition spec is a feature of the Lance Directory Catalog, does that imply: LanceDirectoryCatalog needs to add partition-related methods, such as: resolve_or_create_partition_table, plan_scan, etc. At the engine integration layer (for example, in lance-spark), partitioned tables will ultimately be handled through a LanceDirectoryCatalog instance. For example:
Let me first explain my understanding:
However, there is one thing I haven’t fully figured out yet. The abstraction of lance-namespace is that it supports an arbitrarily deep namespace hierarchy, where all leaf nodes are tables. After splitting out lance-catalog, how should the hierarchical abstractions of lance-namespace and lance-catalog be defined? The definition of lance-namespace will probably remain unchanged. It still needs to support unlimited namespace depth to remain flexible and support concepts such as metalake. Should lance-catalog be standardized to a two-level abstraction of database and table? Or should lance-catalog, like lance-namespace, also support arbitrary intermediate nodes with tables as leaf nodes? In the lance-namespace tree structure, the upper-level nodes are managed by lance-namespace, while some intermediate nodes become lance-catalog nodes, and the entire subtree starting from a lance-catalog node is managed by lance-catalog? From the perspective of the partition spec, if we restrict lance-catalog to only database and table levels, there may be some conflicts. Since we have decided that partitioning is a feature of DirectoryNamespace, does that mean:
Semantically, this feels slightly awkward. If we use namespace, it is easier to understand because namespace is an abstract concept. However, if we use database to represent the partition spec namespace, it may introduce some confusion. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
The Lance Namespace spec has been evolving very rapidly, and now includes a lot of things. So far, it has been confusing to readers about what exactly that is, for very good reasons. At this point, it has become a basket of different things, as listed in the spec itself:
The Lance Namespace spec consists of four main parts:
Client Spec: A consistent abstraction that adapts to various catalog specs,
allowing users to access and operate on a collection of tables in a multimodal lakehouse.
This is the core reason why we call it "Namespace" rather than "Catalog" -
namespace can mean catalog, schema, metastore, database, metalake, etc.,
and the spec provides a unified interface across all of them.
Native Catalog Specs: Natively maintained catalog specs that are compliant with the Lance Namespace client spec:
tables are organized directly on storage (local filesystem, S3, GCS, Azure, etc.)
their own custom handling in their specific enterprise environments.
Implementation Specs: Defines how a given catalog spec integrates with the client spec.
It details how an object in a Lance Namespace maps to an object in the specific catalog spec,
and how each operation in Lance Namespace is fulfilled by the catalog spec.
The implementation specs for Directory and REST namespaces are part of the native Lance Namespace spec.
Implementation specs for other catalog specs
(e.g. Apache Polaris, Unity Catalog, Apache Hive Metastore, Apache Iceberg REST Catalog)
are considered integrations - anyone can provide additional implementation specs outside Lance Namespace,
and they can be owned by external parties without needing to go through the Lance community voting process to be adopted.
Partitioning Spec: Defines a storage format for partitioned namespaces built on the Directory Namespace.
It enables organizing data into physically separated units (partitions) that share a common schema,
with support for partition evolution, pruning, and multi-partition transactions.
Why it is what it is today
This is kind of an organic development, because the original goal of namespace spec is indeed to connect to all the existing catalog/metastore/metalake/... but as a part of this, we also have developed a storage only catalog and a REST version for customer adoption and customization, and we just put it in the namespace spec itself so far.
At the same time, because we use OpenAPI to generate data models for the namespace clients, the REST model essentially serves the double-purpose of both use it for rest catalog and also for namespace client.
Proposal
I propose we separate out the concept of Lance Catalog and Lance Namespace, in the following architecture:
So we clearly separate out what Lance Catalog and Lance Namespace means:
From project structure perspective, directory and rest implementations are in lance repo directly, so I don't see we have to make much change for it.
A few implcations:
I think this would help avoid a lot of confusions and make the whole lakehouse format architecture more clear.
Curious what people think about this
Beta Was this translation helpful? Give feedback.
All reactions