-
Notifications
You must be signed in to change notification settings - Fork 4.7k
[WIP] HIVE-27473: Rewrite MetaStoreClients to be composable #5771
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
[WIP] HIVE-27473: Rewrite MetaStoreClients to be composable #5771
Conversation
The content in Google Doc is perfectly aligned with my memory. I'm still reading the proposed changes. I'd appreciate it if the latest dependencies between MetaStoreClients somewhere were written somewhere |
* It helps to reduce the time spent in compilation by using HS2 memory more effectively, and it allows to | ||
* improve HMS throughput for multi-tenant workloads by reducing the number of calls it needs to serve. | ||
*/ | ||
public class HiveMetaStoreClientWithLocalCache extends NoopHiveMetaStoreClientDelegator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that could cause confusion as we introduce classes with exactly the same name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if we call it CachingMetaStoreClientProxy/Decorator
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was checking the same, the name should be different.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have renamed it to LocalCachingMetaStoreClientProxy
.
* This class provides two features: query-level cache and Session level transaction. | ||
* SG:FIXME, is "Session level transaction" correct explanation? | ||
*/ | ||
public class HiveMetaStoreClientWithSessionFeature extends NoopHiveMetaStoreClientDelegator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SessionMetaStoreClientProxy/Decorator
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have renamed it to SessionMetaStoreClientProxy
.
new org.apache.hadoop.hive.ql.metadata.client.HiveMetaStoreClientWithLocalCache(conf, thriftClient); | ||
IMetaStoreClient sessionLevelClient = | ||
new HiveMetaStoreClientWithSessionFeature(conf, clientWithLocalCache); | ||
IMetaStoreClient clientWithTmpTable = new HiveMetaStoreClientWithTmpTable(conf, sessionLevelClient); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be great if we could extract this logic into MetaStoreClientFactory
that implements IMetaStoreClientFactory
so users could define constructors for custom clients
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the suggested extraction would be more appropriately addressed in HIVE-12679, which is the origin of this PR.
(BTW, I'm particularly interested in supporting custom clients that connect to external catalog such as AWS Glue or the Iceberg REST catalog, and I agree that it would be nice if we provide IMetaStoreClientFactory
as an entrypoint of writing a custom client.)
import java.util.List; | ||
import java.util.Map; | ||
|
||
public class NoopHiveMetaStoreClientDelegator implements IMetaStoreClient { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe MetaStoreClientProxy/Decorator
+ should be abstract
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have renamed it to BaseMetaStoreClientProxy
and made it abstract.
import static org.apache.hadoop.hive.metastore.utils.MetaStoreUtils.getPvals; | ||
import static org.apache.hadoop.hive.metastore.utils.MetaStoreUtils.isExternalTable; | ||
|
||
public class HiveMetaStoreClientWithTmpTable extends NoopHiveMetaStoreClientDelegator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i am not sure we need to extract this since we have just 1 internal usage of it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I integrate this layer with session feature layer. Now SessionMetaStoreClientProxy
handles both temporary tables and session-level features.
@okumin, I have added an Appendix section to the design document, which summarizes the dependencies between MetaStoreClients as well as which classes in the PR codebase provide specific features compared to the master branch. I'm not entirely sure if this aligns with your request, so please let me know if it doesn't address what you were looking for. |
@deniskuzZ, Thank you for looking over the PR. I think the suggested class names make sense, and I'll update the PR accordingly. Regarding |
import java.util.Map; | ||
|
||
public class NoopHiveMetaStoreClientDelegator implements IMetaStoreClient { | ||
// effectively final |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can it be final?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it could. I changed it to final.
CacheKey cacheKey = new CacheKey(KeyType.TABLE, req); | ||
Table r = (Table) mscLocalCache.getIfPresent(cacheKey); | ||
if (r == null) { | ||
r = getDelegate().getTable(req); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use this.delegate instead of getDelegate().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my opinion, since we cannot remove the public accessor getDelegate()
due to HiveMetaStoreClient
, it would be better to keep using getDelegate()
in inherited classes as well as HiveMetaStoreClient
. I'll check HiveMetaStoreClient
again and remove the public accessor if we can safely decouple it from underlying client.
this.delegate = delegate; | ||
} | ||
|
||
public IMetaStoreClient getDelegate() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why should this be available outside of this class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the review. I made it possible to set and access delegate
externally because it was necessary during the early implementation. Since this PR code is directly come from the PoC stage, the code is quite rough, but the next commit will likely be an improved version that addresses your feedback.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, HiveMetaStoreClient
needs the public accessor. I didn't investigate the code deeply yet, and I'll check whether we can decouple HiveMetaStoreCleint
from its delegate and finally remove this public accessor.
return new GetPartitionsByNamesResult(result); | ||
private static IMetaStoreClient createUnderlyingClient(Configuration conf, HiveMetaHookLoader hookLoader, | ||
Boolean allowEmbedded) throws MetaException { | ||
IMetaStoreClient thriftClient = new ThriftHiveMetaStoreClient(conf, allowEmbedded); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we have ThriftHiveMetaStoreClient here it means we can not reuse the functionality of this class with other than ThriftHiveMetaStoreClient e.g. RESTClient for Iceberg REST catalog can not reuse the functionality of ql/HiveMetaStoreClientWithLocalCache. This should be an input as delegate I guess.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't consider custom catalogs in this PR as it is not a consideration of HIVE-27473. In my opinion, HIVE-12679 and HIVE-28658 could address custom catalog issues based on the separated HMSClient layers introduced in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked the dropTable-related methods, and they should work. I also think removing the zig-zag dependencies helps our maintainability of the Hive project.
Instead, we have to maintain lots of @Override
in any classes, and third-party developers also have to do it. It could be a negative point we want to resolve someday.
public void createTable(Table tbl) throws MetaException, NoSuchObjectException, TException { | ||
CreateTableRequest request = new CreateTableRequest(tbl); | ||
createTable(request); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may also need createTable(Table tbl, EnvironmentContext envContext)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't add createTable(Table tbl, EnvironmentContext envContext)
since it is not a method of IMetaStoreClient
.
It seems that only the test code uses this method, so I think we can safely skip this method by modifying the test to use createTable(CreateTableRequest request)
, which is a method of IMetaStoreClient
and can be used to pass both Table
and EnvironmentContext
.
@Override | ||
public void createTable(CreateTableRequest request) | ||
throws MetaException, NoSuchObjectException, TException { | ||
Table tbl = request.getTable(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be more consistent if we set the default catalog name when it's empty
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that your suggestion makes the code more consistent.
However, I have a question about the use of the default catalog and where the responsibility for setting it should lie. Currently, we use getDefaultCatalog()
in many places to fill in empty catalog names. In my opinion, there should ideally be a single point of responsibility, or at least as few classes as possible, where the default catalog is defined and applied.
I also think we should minimize the behavior of proxy classes whenever possible. Rather than setting the default catalog within proxy classes, it might be better to pass null and let the underlying class, or even HMSHandler
, handle it.
I'm not entirely sure this is a better approach, and I'm fine with continuing the current strategy of applying getDefaultCatalog()
wherever needed. It would be helpful if you could share your opinion on this issue.
ignoreUnknownTable); | ||
// SG:FIXME, introduce IMetaStoreClient.dropTable :: DropTableRequest -> void | ||
// We need a new method that takes catName and checks tbl.isSetTxnId() | ||
// Or does TxN stuff meaningful only if we are in default catalog? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to invoke the hook.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have enclosed the dropTable
call with the hook. I'll check this method again as well as the noted issue when I revisit HookMetaStoreClientProxy
.
e620366
to
147329d
Compare
@okumin , @deniskuzZ , @zratkai, thank you for your review. I've updated the branch to address your feedback and improve code quality. The main changes are summarized as follows:
There are still several remaining issues, including |
I found Hive 4.0.0 has around 20 deprecated methods. Are we allowed to discontinue them before HIVE-27473? |
Can we raise the proposal to the mailing list and tie-break the best approaches at this point? People in #4444 should be very interested. Then, we can invest our resources in reviewing this pull request. The Option 2 of my original document might make integration clean. We will add a small change to Apache Hive, and third-party vendors implement the minimal number of methods. However, this approach doesn't allow existing integrations to run as they are. If we can make Option 1 clean and the community people agree with it, I think many people would be happy. Additionally, the HiveMetaStoreClient family with complicated dependencies will be decomposed! |
I think removing the deprecated methods would be helpful for simplifying the work in HIVE-27473. However, I'm not entirely sure about the standard process for removing deprecated methods in Hive. If their removal is acceptable at this stage, I would be supportive of proceeding in that direction and able to work for it. |
Sure, it would be nice to make an intermediate design decision at this point. |
I know Glue Data Catalog depends on the unmerged HIVE-12679, and I can infer that Databricks also uses or has used it from their release notes. Keeping the compatibility might help those users(it is also possible that they prefer a cleaner interface). I will involve them once we start the discussion. |
I happened to find Apache Ranger also relies on IMetaStoreClient. I believe HIVE-27473 does not impact the feature, but I am leaving a comment nonetheless. |
@okumin, I understand now. Thank you for the explanation. |
i think we can drop the deprecated methods in 4.1 and right now review and mark methods that we want to discontinue |
|
@@ -3411,6 +3411,11 @@ public boolean listPartitionsByExpr(String catName, String db_name, String tbl_n | |||
throw new UnsupportedOperationException(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This class is being removed.
#5812
My review progress is visible here. @ngsg @deniskuzZ @dengzhhu653 While reviewing the pull request, I found MetaStoreFilterHook is used in 40+ places, and I guess we no longer need it. Do you think it is a reasonable option to discontinue it immediately? Reason: MetaStoreFilterHook is declared as LimitedPrivate for Apache Sentry, which is already retired. At least, I don't see any use cases in Apache Hive(and Apache Ranger as well). It would simplify the entire dependency and reduce the risks where inconsistencies can unexpectedly cause a problem. % git grep metastore.filter.hook
common/src/java/org/apache/hadoop/hive/conf/HiveConf.java: METASTORE_FILTER_HOOK("hive.metastore.filter.hook", "org.apache.hadoop.hive.metastore.DefaultMetaStoreFilterHookImpl",
data/files/rfc5424-hs2.log:<14>1 2019-03-21T07:06:19.799Z hiveserver2-99ddccbd8-49sbv hiveserver2 1 cb6c1a29-4ba7-11e9-83d0-02c17ec35ac0 [mdc@18060 class="metastore.HiveMetaStoreClient" level="INFO" thread="main"] Mestastore configuration metastore.filter.hook changed from org.apache.hadoop.hive.ql.security.authorization.plugin.AuthorizationMetaStoreFilterHook to org.apache.hadoop.hive.metastore.DefaultMetaStoreFilterHookImpl
itests/util/src/main/java/org/apache/hadoop/hive/ql/QTestUtil.java: conf.set("hive.metastore.filter.hook", "org.apache.hadoop.hive.metastore.DefaultMetaStoreFilterHookImpl");
ql/src/test/queries/clientpositive/explainanalyze_3.q:set hive.metastore.filter.hook=org.apache.hadoop.hive.metastore.DefaultMetaStoreFilterHookImpl;
ql/src/test/queries/clientpositive/explainuser_3.q:set hive.metastore.filter.hook=org.apache.hadoop.hive.metastore.DefaultMetaStoreFilterHookImpl;
standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/conf/MetastoreConf.java: FILTER_HOOK("metastore.filter.hook", "hive.metastore.filter.hook",
% |
@Override | ||
public List<String> getTables(String dbname, String tablePattern, TableType tableType) throws MetaException { | ||
try { | ||
return getTables(getDefaultCatalog(conf), dbname, tablePattern, tableType); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this needs to be getDelegate().getTables(dbname, tablePattern, tableType)
. That's because SessionMetaStoreClientProxy
has getTables(String dbname, String tablePattern, TableType tableType)
but doesn't have getTables(String catName, String dbname, String tablePattern, TableType tableType)
.
public List<TableMeta> getTableMeta(String dbPatterns, String tablePatterns, List<String> tableTypes) | ||
throws MetaException { | ||
try { | ||
return getTableMeta(getDefaultCatalog(conf), dbPatterns, tablePatterns, tableTypes); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to call getDeletage().getTableMeta(String dbPatterns, String tablePatterns, List<String> tableTypes)
@Override | ||
public List<String> getAllTables(String dbname) throws MetaException { | ||
try { | ||
return getAllTables(getDefaultCatalog(conf), dbname); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably, this needs to use getDelegate().getAllTables(dbname)
I suspended the review because there are some cases where |
Sorry, MetaStoreFilterHook is used as HiveMetaStoreAuthorizer or AuthorizationMetaStoreFilterHook. I was searching only for the standalone-metastore project. My bad |
} | ||
|
||
@Override | ||
public Partition appendPartition(String catName, String dbName, String tableName, List<String> partVals) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have to override appendPartition(String dbName, String tableName, List<String> partVals)
. The original HiveMetaStoreClient#appendPartition(String dbName, String tableName, List<String> partVals)
invokes SessionHiveMetaStoreClient#appendPartition(String dbName, String tableName, List<String> partVals)
.
} | ||
|
||
@Override | ||
public Partition appendPartition(String catName, String dbName, String tableName, String partitionName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have to override appendPartition(String dbName, String tableName, String name)
} | ||
|
||
@Override | ||
public int getNumPartitionsByFilter(String catName, String dbName, String tableName, String filter) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
int getNumPartitionsByFilter(String dbName, String tableName, String filter)
also needs to handle tmp tables.
Current we need this in the client side, IMO it's better to put the filter on the ThriftHiveMetaStoreClient to enforce the check |
I prefer the way to enhance the client through each lawyer. |
|
||
abstract public class BaseMetaStoreClientProxy implements IMetaStoreClient { | ||
|
||
private final IMetaStoreClient delegate; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: private -> protected
public class SessionHiveMetaStoreClient extends HiveMetaStoreClientWithLocalCache implements IMetaStoreClient { | ||
private static final Logger LOG = LoggerFactory.getLogger(SessionHiveMetaStoreClient.class); | ||
|
||
public class SessionHiveMetaStoreClient extends BaseMetaStoreClientProxy implements IMetaStoreClient { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about make the constructor private, and create a static method newSessionHiveMetaStoreClient(Configuration conf, HiveMetaHookLoader hookLoader, Boolean allowEmbedded)
new LocalCachingMetaStoreClientProxy(conf, thriftClient); | ||
IMetaStoreClient sessionLevelClient = | ||
new SessionMetaStoreClientProxy(conf, clientWithLocalCache); | ||
IMetaStoreClient clientWithHook = new HookMetaStoreClientProxy(conf, hookLoader, sessionLevelClient); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
check if hookLoader
is null before creating the HookMetaStoreClientProxy
?
} | ||
|
||
@Override | ||
public void truncateTable(String dbName, String tableName, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we keep these methods in this class?
|
||
@Override | ||
public void dropDatabase(DropDatabaseRequest req) throws TException { | ||
if (req.isCascade()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this logic both on this class and ThriftHiveMetaStoreClient?
} | ||
|
||
@Override | ||
public boolean dropPartition(String catName, String dbName, String tblName, List<String> partVals, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we have to override all dropPartition variants so that all of them handle tmp tables.
} | ||
|
||
@Override | ||
public void alter_partition(String catName, String dbName, String tblName, Partition newPart, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we have to override all variants.
I reviewed half of the methods and commented on where I found issues. My impression is that the number of methods is huge, and it is not trivial to find mistakes. I found we typically have two types of problems. To avoid them, I am wondering if we can have an abstract class that enforces two rules. This is a sample implementation.
From the point of view of vendors or Hive maintainers, they don't want to implement and maintain 5 types of In the new foundation, only the most generic method works as a bridge, and the other variants work just as aliases, which are correct in any layer. We can asynchronously discontinue legacy variants. P.S. Anyway, I'm still reviewing the current PR, as it might not be easy to reach a consensus to add one more refactoring ticket. |
public List<String> getTables(String catName, String dbName, String tablePattern) throws TException { | ||
// SG:FIXME, inefficient | ||
List<String> tableNames = getDelegate().getTables(catName, dbName, tablePattern); | ||
// tables is filtered by FilterHook. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use filterTableNamesIfEnabled
? Probably, the answer is no.
hook.preTruncateTable(table, context, partNames); | ||
} | ||
// SG:FIXME | ||
getDelegate().truncateTable(dbName, tableName, partNames, validWriteIds, writeId, deleteData); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably have to propagate catName.
} | ||
|
||
@Override | ||
public List<Table> getTableObjectsByName(String dbName, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this is not reachable. I also wonder what values this method is adding. I feel getTableObjectsByName can be an alias of getTableObjectsByName.
I checked all methods and commented on all places where I had questions |
Thank you for your exhaustive reviews. Based on your comments, I summarize the remaining tasks as follows:
Unfortunately, I won’t be able to make significant progress on HIVE-27473 in the near future. I’ll do my best to return to it as soon as possible. Thanks again for your valuable reviews. |
I also thought of a similar approach when I was writing the PoC, and I agree that the two rules you stated must be upheld. Let me check the code again and try to implement your idea in the next commit. |
Design document: Google docs
What changes were proposed in this pull request?
This patch refactors
HiveMetaStoreClient
and its subclasses in composable classes.Why are the changes needed?
By making
SessionHiveMetaStoreClient
composable, this refactoring enables users to create and use custom Catalogs with Hive's session-level features, such as temporary tables.Does this PR introduce any user-facing change?
No
Is the change a dependency upgrade?
No
How was this patch tested?
This patch is currently in draft state, so no tests have been added yet.