Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generalize engine definition for Iceberg tables #675

Merged
merged 8 commits into from
Apr 8, 2025

Conversation

ianton-ru
Copy link

Changelog category (leave one):

  • Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Use Iceberg function/engine with storage type as parameter instead of IcebergS3/IcebergAzure/IcebergHDFS

Documentation entry for user-facing changes

ClickHouse has several table functions and table engines to work with Iceberg tables, separate function/engine for each available storage type - icebergS3, icebergAzure, icebergHDFS.

This PR allows to use single table function iceberg and table engine Iceberg with named parameter storage_type.

Syntax before:

SELECT * FROM icebergS3('http://minio1:9000/root/table_data', 'minio', 'minio123', 'Parquet');
SELECT * FROM icebergAzureCluster('mycluster', 'http://azurite1:30000/devstoreaccount1', 'cont', '/table_data', 'devstoreaccount1', 'Eby8vdM02xNOcqFlqUwJPLlmEtlCDXJ1OUzFT50uSRZ6IFsuFq2UVErCz4I6tq/K1SZFPTOtr/KBHBeksoGMGw==', 'Parquet');
CREATE TABLE mytable ENGINE=IcebergHDFS('/table_data', 'Parquet');

Syntax after:

SELECT * FROM iceberg(storage_type='s3', 'http://minio1:9000/root/table_data', 'minio', 'minio123', 'Parquet');
SELECT * FROM icebergCluster('mycluster', storage_type='azure', 'http://azurite1:30000/devstoreaccount1', 'cont', '/table_data', 'devstoreaccount1', 'Eby8vdM02xNOcqFlqUwJPLlmEtlCDXJ1OUzFT50uSRZ6IFsuFq2UVErCz4I6tq/K1SZFPTOtr/KBHBeksoGMGw==', 'Parquet');
CREATE TABLE mytable ENGINE=Iceberg('/table_data', 'Parquet', storage_type='hdfs');

Also if named collection is used to store access parameters, field storage_type can be placed in same named collection:

<named_collections>
    <s3>
        <url>http://minio1:9001/root/</url>
        <access_key_id>minio</access_key_id>
        <secret_access_key>minio123</secret_access_key>
        <storage_type>s3</storage_type>
    </s3>
</named_collections>
SELECT * FROM iceberg(s3, filename='table_data');

By default storage_type is s3 for backward compatibility with table function iceberg - now it is an alias of icebergS3.

@altinity-robot
Copy link
Collaborator

altinity-robot commented Mar 7, 2025

This is an automated comment for commit 9d1a3c4 with description of existing statuses. It's updated for the latest CI running

❌ Click here to open a full report in a separate page

Check nameDescriptionStatus
Integration testsThe integration tests report. In parenthesis the package type is given, and in square brackets are the optional part/total tests❌ error
Regression aarch64 Tiered Storage s3amazonThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS❌ failure
Regression aarch64 Tiered Storage s3gcsThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS❌ failure
Sign aarch64There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS❌ error
Sign releaseThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS❌ error
Stateless testsRuns stateless functional tests for ClickHouse binaries built in various configurations -- release, debug, with sanitizers, etc❌ error
Stress testRuns stateless functional tests concurrently from several clients to detect concurrency-related errors❌ error
Successful checks
Check nameDescriptionStatus
BuildsThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Compatibility checkChecks that clickhouse binary runs on distributions with old libc versions. If it fails, ask a maintainer for help✅ success
Docker keeper imageThe check to build and optionally push the mentioned image to docker hub✅ success
Docker server imageThe check to build and optionally push the mentioned image to docker hub✅ success
Install packagesChecks that the built packages are installable in a clear environment✅ success
Ready for releaseThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 Alter attach partitionThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 Alter move partitionThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 Alter replace partitionThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 Benchmark aws_s3There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 Benchmark gcsThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 Benchmark minioThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 Clickhouse Keeper SSLThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 LDAP authenticationThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 LDAP external_user_directoryThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 LDAP role_mappingThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 Parquet aws_s3There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 Parquet minioThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 ParquetThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 S3 azureThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 S3 gcsThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 S3 minioThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 Tiered Storage minioThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 aes_encryptionThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 atomic_insertThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 base_58There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 clickhouse_keeperThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 data_typesThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 datetime64_extended_rangeThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 disk_level_encryptionThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 dnsThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 enginesThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 exampleThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 extended_precision_data_typesThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 kafkaThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 kerberosThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 key_valueThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 lightweight_deleteThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 memoryThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 part_moves_between_shardsThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 selectsThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 session_timezoneThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 tiered_storageThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 window_functionsThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression release Alter attach partitionThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression release Alter move partitionThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression release Benchmark aws_s3There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression release Benchmark gcsThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression release Benchmark minioThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression release Clickhouse Keeper SSLThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression release aes_encryptionThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression release atomic_insertThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression release base_58There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression release data_typesThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression release datetime64_extended_rangeThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression release dnsThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression release enginesThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression release exampleThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression release extended_precision_data_typesThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression release kafkaThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression release kerberosThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression release key_valueThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression release part_moves_between_shardsThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Stateful testsRuns stateful functional tests for ClickHouse binaries built in various configurations -- release, debug, with sanitizers, etc✅ success

Copy link
Collaborator

@arthurpassos arthurpassos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First pass is done, will do another one later

@@ -75,6 +93,7 @@ class FunctionSecretArgumentsFinder
{
if (index >= function->arguments->size())
return;
index = function->arguments->getRealIndex(index);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please create a new variable and give it a different name. Some comments on what "real index" means would be cool as well.

std::string collection_name;
if (function->arguments->at(0)->tryGetString(&collection_name, true))
{
NamedCollectionPtr collection = NamedCollectionFactory::instance().tryGet(collection_name);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In which case would NamedCollectionFactory::instance().tryGet fail to get a named collection? I am asking because I assumed isNamedCollectionName(0) was already making sure the named collection existed.

If there is a chance this will fail, shouldn't you throw?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isNamedCollectionName actually only check that this argument is an identifier. This code is only for masking passwords and other sensitive info in logs etc., and getting storage_type from named collection here is only because S3 and Azure have passwords in different positions. I believe that if this code can't detect type correctly it should not break query execution more than it was broken before, because is used for example when ClickHouse writes query when it already catch another exception.

@@ -144,15 +150,235 @@ class DataLakeConfiguration : public BaseStorageConfiguration, public std::enabl
using StorageS3IcebergConfiguration = DataLakeConfiguration<StorageS3Configuration, IcebergMetadata>;
# endif

#if USE_AZURE_BLOB_STORAGE
# if USE_AZURE_BLOB_STORAGE
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Undo

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is inside other #if

#if USE_AVRO
...
#    if USE_AZURE_BLOB_STORAGE
....
#    endif
....
#endif

@@ -167,8 +167,7 @@ class StorageObjectStorage::Configuration
using Path = std::string;
using Paths = std::vector<Path>;

static void initialize(
Configuration & configuration,
virtual void initialize(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a strong reason on why you made this a member instead of a static method?

I don't have a preference, perhaps a member is indeed better, but makes the review process way harder. There are a few changes in this PR, and the other ones I reviewed, that seem kind of unrelated.

I think it would be best if you did the refactorings in different PRs and kept changes to a minimum, that would make the review process much easier

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to override it in StorageIcebergConfiguration, and virtual member can't be static.

virtual void fromNamedCollection(const NamedCollection & collection, ContextPtr context) = 0;
virtual void fromAST(ASTs & args, ContextPtr context, bool with_structure) = 0;

virtual ObjectStorageType extractDynamicStorageType(ASTs & /* args */, ContextPtr /* context */, ASTPtr * /* type_arg */ = nullptr) const
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a static storage type as well?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it returns S3, Azure, HDFS, Local for different storage_type parameter if it exists.

});

# if USE_AWS_S3
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Undo

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same - inside #if USE_AVRO

using StorageHDFSIcebergConfiguration = DataLakeConfiguration<StorageHDFSConfiguration, IcebergMetadata>;
# endif

using StorageLocalIcebergConfiguration = DataLakeConfiguration<StorageLocalConfiguration, IcebergMetadata>;


class StorageIcebergConfiguration : public StorageObjectStorage::Configuration, public std::enable_shared_from_this<StorageObjectStorage::Configuration>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps a comment mentioning it works for s3, azure and etc

"DataLake can have only one key-value argument: storage_type=().");
}

auto value = type_ast_function->arguments->children[1]->as<ASTLiteral>();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't checked the code, but doesn't as throw if cast is invalid?

Copy link
Author

@ianton-ru ianton-ru Mar 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It throws exception only for references, for pointers it returns nullptr when can't cast to type.
https://github.com/ClickHouse/ClickHouse/blob/master/src/Common/typeid_cast.h#L38
(typeid_cast is used inside at)


if (name && name->name() == storage_type_name)
{
if (type_it != args.end())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would use an extra boolean variable for clarity: bool found_storage_type_argument

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used the same way as in StorageURL::evalArgsAndCollectHeaders already.
https://github.com/ClickHouse/ClickHouse/blob/master/src/Storages/StorageURL.cpp#L1503
It's a good idea to refactor it later and make a common method to extract key-value arguments, let's make this change in that method.

Comment on lines +35 to +49
void skipArgument(size_t n) { skipped_indexes.insert(n); }
void unskipArguments() { skipped_indexes.clear(); }
size_t getRealIndex(size_t n) const
{
for (auto idx : skipped_indexes)
{
if (n < idx)
break;
++n;
}
return n;
}
size_t skippedSize() const { return skipped_indexes.size(); }
private:
std::set<size_t> skipped_indexes;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is kind of confusing. I get the idea, but not entirely and why it has to be done this way.

Why do you have to manually call getRealIndex in some places knowing you already implemented that in ARgumentsAST:

std::unique_ptr<Argument> at(size_t n) const override
        { /// n is relative index, some can be skipped
            return std::make_unique<ArgumentTreeNode>(arguments->at(getRealIndex(n)).get());
        }

Why unskip?

Copy link
Author

@ianton-ru ianton-ru Apr 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In markSecretArgument and maskAzureConnectionString index goes outside in code, which works with raw index, not through this local Arguments class.
Like here - https://github.com/ClickHouse/ClickHouse/blob/master/src/Analyzer/Resolve/QueryAnalyzer.cpp#L2932

@@ -287,6 +317,48 @@ class FunctionSecretArgumentsFinder
markSecretArgument(url_arg_idx + 4);
}

std::string findIcebergStorageType()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you have findIcebergStorageType and extractDynamicStorageType?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't you implement the same remove argument approach instead of doing the "skipIndices" thing?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

findIcebergStorageType is used in logic to mask secrets, different storage types has secrets on different position, so need to detect type to choose proper method (findS3TableEngineSecretArguments or findAzureTableEngineSecretArguments). I use "skip" thing because can't remove argument - this code is used to writing in logs, and add arguments must be there.
In my opinion here hard to reuse code, and other old methods like findS3TableEngineSecretArguments don't reuse code, have a logic duplicate instead. Code reusing makes code more complex here.

arthurpassos
arthurpassos previously approved these changes Apr 5, 2025
@ianton-ru ianton-ru force-pushed the feature/antalya-iceberg-table-all-storages branch from 9d1a3c4 to 2f4960c Compare April 7, 2025 10:40
@ianton-ru ianton-ru changed the base branch from antalya to antalya-25.2 April 7, 2025 11:46
@ianton-ru ianton-ru changed the base branch from antalya-25.2 to antalya April 7, 2025 12:13
@MyroTk MyroTk added the antalya-25.2.2 Planned for 25.2.2 release label Apr 7, 2025
@Enmk Enmk merged commit 1f6e189 into antalya Apr 8, 2025
233 of 332 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
antalya-25.2.2 Planned for 25.2.2 release
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants