Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support persisting TableMetadata in the metastore #433

Open
wants to merge 41 commits into
base: main
Choose a base branch
from

Conversation

eric-maynard
Copy link
Contributor

@eric-maynard eric-maynard commented Nov 7, 2024

Description

This adds a new flag METADATA_CACHE_MAX_BYTES which allows the catalog to store table metadata in the metastore and vend it from there when loadTable is called.

Entries are cached based on the metadata location. Currently, the entire metadata.json content is cached.


Features not included in this PR:

  1. Support for updating the cache when a table is updated
  2. Support for invalidating cache entries in the background, rather than waiting for loadTable to be called
  3. Structured storage for table metadata

There is partial support for (1) here and I want to extend it, but the goal is to structure things in a way that will allow us to implement (2) and (3) in the future as well.

Type of change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • Documentation update
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Existing tests vend table metadata correctly when caching is enabled.

Added a small test in BasePolarisCatalogTest to cover the basic semantics of caching

Manual testing with eclipselink -- I observed the entities getting created in Postgres and saw large metadata being cached:

db=# select length(internalproperties), substring(internalproperties, 1, 1000) from entities where id = 152;
...
 768691 | {"metadata_location":"file:/tmp/quickstart/ns/tn1731005976265/metadata/00000-e77a2576-5efa-4b7a-b948-121813d713f8.metadata.json","content":"{\"format ...

With MySQL, small metadata is persisted:

mysql> SELECT length(internalproperties), substring(internalproperties, 1, 1000) FROM ENTITIES WHERE id = (SELECT MAX(id) FROM ENTITIES WHERE typecode = 10);
. . .
8159 | {"metadata_location":"file:/tmp/quickstart/ns/t2/metadata/00000-64f975bd-c3a8-4069-bb56-f282003e9157.metadata.json","content":"{\"format-version\"

However large metadata may cause internalproperties to exceed the size limit and nothing will be cached. Calls still return safely.

@flyrain
Copy link
Contributor

flyrain commented Nov 7, 2024

This will potentially reduce a lot of I/O overhead! Thanks for working on it.

Copy link
Contributor

@dennishuo dennishuo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking a stab at this! I think it's worth discussing whether metadata contents could be better stored within the TableLikeEntity itself.

}

// if the metadata is not changed, return early
if (base == metadata) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we fix this too? if base.path != metadata.path

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants