-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generalize engine definition for Iceberg tables #675
Conversation
This is an automated comment for commit 9d1a3c4 with description of existing statuses. It's updated for the latest CI running ❌ Click here to open a full report in a separate page
Successful checks
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First pass is done, will do another one later
@@ -75,6 +93,7 @@ class FunctionSecretArgumentsFinder | |||
{ | |||
if (index >= function->arguments->size()) | |||
return; | |||
index = function->arguments->getRealIndex(index); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please create a new variable and give it a different name. Some comments on what "real index" means would be cool as well.
std::string collection_name; | ||
if (function->arguments->at(0)->tryGetString(&collection_name, true)) | ||
{ | ||
NamedCollectionPtr collection = NamedCollectionFactory::instance().tryGet(collection_name); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In which case would NamedCollectionFactory::instance().tryGet
fail to get a named collection? I am asking because I assumed isNamedCollectionName(0)
was already making sure the named collection existed.
If there is a chance this will fail, shouldn't you throw?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isNamedCollectionName actually only check that this argument is an identifier. This code is only for masking passwords and other sensitive info in logs etc., and getting storage_type from named collection here is only because S3 and Azure have passwords in different positions. I believe that if this code can't detect type correctly it should not break query execution more than it was broken before, because is used for example when ClickHouse writes query when it already catch another exception.
@@ -144,15 +150,235 @@ class DataLakeConfiguration : public BaseStorageConfiguration, public std::enabl | |||
using StorageS3IcebergConfiguration = DataLakeConfiguration<StorageS3Configuration, IcebergMetadata>; | |||
# endif | |||
|
|||
#if USE_AZURE_BLOB_STORAGE | |||
# if USE_AZURE_BLOB_STORAGE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Undo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is inside other #if
#if USE_AVRO
...
# if USE_AZURE_BLOB_STORAGE
....
# endif
....
#endif
@@ -167,8 +167,7 @@ class StorageObjectStorage::Configuration | |||
using Path = std::string; | |||
using Paths = std::vector<Path>; | |||
|
|||
static void initialize( | |||
Configuration & configuration, | |||
virtual void initialize( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a strong reason on why you made this a member instead of a static method?
I don't have a preference, perhaps a member is indeed better, but makes the review process way harder. There are a few changes in this PR, and the other ones I reviewed, that seem kind of unrelated.
I think it would be best if you did the refactorings in different PRs and kept changes to a minimum, that would make the review process much easier
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to override it in StorageIcebergConfiguration, and virtual member can't be static.
virtual void fromNamedCollection(const NamedCollection & collection, ContextPtr context) = 0; | ||
virtual void fromAST(ASTs & args, ContextPtr context, bool with_structure) = 0; | ||
|
||
virtual ObjectStorageType extractDynamicStorageType(ASTs & /* args */, ContextPtr /* context */, ASTPtr * /* type_arg */ = nullptr) const |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a static storage type as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it returns S3
, Azure
, HDFS
, Local
for different storage_type
parameter if it exists.
}); | ||
|
||
# if USE_AWS_S3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Undo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same - inside #if USE_AVRO
using StorageHDFSIcebergConfiguration = DataLakeConfiguration<StorageHDFSConfiguration, IcebergMetadata>; | ||
# endif | ||
|
||
using StorageLocalIcebergConfiguration = DataLakeConfiguration<StorageLocalConfiguration, IcebergMetadata>; | ||
|
||
|
||
class StorageIcebergConfiguration : public StorageObjectStorage::Configuration, public std::enable_shared_from_this<StorageObjectStorage::Configuration> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps a comment mentioning it works for s3, azure and etc
"DataLake can have only one key-value argument: storage_type=()."); | ||
} | ||
|
||
auto value = type_ast_function->arguments->children[1]->as<ASTLiteral>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't checked the code, but doesn't as
throw if cast is invalid?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It throws exception only for references, for pointers it returns nullptr
when can't cast to type.
https://github.com/ClickHouse/ClickHouse/blob/master/src/Common/typeid_cast.h#L38
(typeid_cast
is used inside at
)
|
||
if (name && name->name() == storage_type_name) | ||
{ | ||
if (type_it != args.end()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would use an extra boolean variable for clarity: bool found_storage_type_argument
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used the same way as in StorageURL::evalArgsAndCollectHeaders
already.
https://github.com/ClickHouse/ClickHouse/blob/master/src/Storages/StorageURL.cpp#L1503
It's a good idea to refactor it later and make a common method to extract key-value arguments, let's make this change in that method.
void skipArgument(size_t n) { skipped_indexes.insert(n); } | ||
void unskipArguments() { skipped_indexes.clear(); } | ||
size_t getRealIndex(size_t n) const | ||
{ | ||
for (auto idx : skipped_indexes) | ||
{ | ||
if (n < idx) | ||
break; | ||
++n; | ||
} | ||
return n; | ||
} | ||
size_t skippedSize() const { return skipped_indexes.size(); } | ||
private: | ||
std::set<size_t> skipped_indexes; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is kind of confusing. I get the idea, but not entirely and why it has to be done this way.
Why do you have to manually call getRealIndex
in some places knowing you already implemented that in ARgumentsAST:
std::unique_ptr<Argument> at(size_t n) const override
{ /// n is relative index, some can be skipped
return std::make_unique<ArgumentTreeNode>(arguments->at(getRealIndex(n)).get());
}
Why unskip?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In markSecretArgument
and maskAzureConnectionString
index goes outside in code, which works with raw index, not through this local Arguments
class.
Like here - https://github.com/ClickHouse/ClickHouse/blob/master/src/Analyzer/Resolve/QueryAnalyzer.cpp#L2932
@@ -287,6 +317,48 @@ class FunctionSecretArgumentsFinder | |||
markSecretArgument(url_arg_idx + 4); | |||
} | |||
|
|||
std::string findIcebergStorageType() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you have findIcebergStorageType
and extractDynamicStorageType
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couldn't you implement the same remove argument approach instead of doing the "skipIndices" thing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
findIcebergStorageType
is used in logic to mask secrets, different storage types has secrets on different position, so need to detect type to choose proper method (findS3TableEngineSecretArguments
or findAzureTableEngineSecretArguments
). I use "skip" thing because can't remove argument - this code is used to writing in logs, and add arguments must be there.
In my opinion here hard to reuse code, and other old methods like findS3TableEngineSecretArguments
don't reuse code, have a logic duplicate instead. Code reusing makes code more complex here.
9d1a3c4
to
2f4960c
Compare
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Use Iceberg function/engine with storage type as parameter instead of IcebergS3/IcebergAzure/IcebergHDFS
Documentation entry for user-facing changes
ClickHouse has several table functions and table engines to work with Iceberg tables, separate function/engine for each available storage type -
icebergS3
,icebergAzure
,icebergHDFS
.This PR allows to use single table function
iceberg
and table engineIceberg
with named parameterstorage_type
.Syntax before:
Syntax after:
Also if named collection is used to store access parameters, field
storage_type
can be placed in same named collection:By default
storage_type
iss3
for backward compatibility with table functioniceberg
- now it is an alias oficebergS3
.