Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Block the creation of catalog with WASB[S] locations #376

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

tzuan16
Copy link
Contributor

@tzuan16 tzuan16 commented Oct 16, 2024

Description

Polaris supports creating catalogs with the deprecated wasbs:// scheme (as opposed to the newer abfss://). However, the current implementation of Iceberg SDK's default ResolvingFileIO would fallback to using the HadoopFileIO when it's provided a wasbs:// path, where the fileIO does not support using vended credentials to read/write to Azure location, as opposed to the ADLSFileIO which supports credential vending and is used for abfss:// paths. This results in when external engines (e.g. Spark) are provided by Polaris a wasbs:// path and a token to access the location, the token cannot be accepted by the fallback HadoopFileIO.

In short, despite Polaris being capable of vending credentials for wasbs:// paths, external engines using the Iceberg SDK for fileIO implementation (which generally is the case) couldn't pick up the vended credentials, and users would need to provide the external engines their own credentials to perform I/O on the location.

This holds until the proposed fix from the Iceberg community is released and users adopt the Iceberg SDK version that contains the fix.

Until then, Polaris would provide a configurable parameter to decide whether to support the creation of wasbs:// pathed catalogs, and the default value would be False (no support).

Type of change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • Documentation update
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration

  • PolarisAdminServiceValidationTest: scenarios on param on/off

Test Configuration:

  • Hardware:
  • Toolchain:
  • SDK:

Checklist:

Please delete options that are not relevant.

  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • If adding new functionality, I have discussed my implementation with the community using the linked GitHub issue

@tzuan16 tzuan16 force-pushed the block-wasbs-catalog-creation branch from e1189cb to 12b8245 Compare October 16, 2024 05:14
Comment on lines +33 to +37
public static final String ABFS_SCHEME = "abfs://";
public static final String ABFSS_SCHEME = "abfss://";
public static final String WASB_SCHEME = "wasb://";
public static final String WASBS_SCHEME = "wasbs://";

Copy link
Contributor

@eric-maynard eric-maynard Oct 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typically the scheme is not considered to include the ://. I also think there might be a more extensible way to capture this than 4 constants.

Furthermore, isn't this duplicative of the constants in WasbTranslatingFileIO as well as StorageType?

@@ -191,4 +191,12 @@ public static <T> Builder<T> builder() {
"If set to true, allows tables to be dropped with the purge parameter set to true.")
.defaultValue(true)
.build();

public static final PolarisConfiguration<Boolean> SUPPORT_WASB_CATALOG =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should re-examine this approach. It's not clear to me how this pattern can extend to having more than 2 schemes -- for example S3 has s3n, s3a, and s3. Also, conceivably, you might want to allow just abfss and not abfs or vice versa.

Copy link
Contributor Author

@tzuan16 tzuan16 Oct 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree that the current approach is not at all extensible. I was hoping we have a test annotation like @WithConfig(configName="true") that could directly modify the config store, but that would require a larger test framework change that would fall out of scope of this PR. Currently the config store only contains getter methods and there's also no way to modify its content via code via a setter method, so I had to proceed with this pattern of initiating two DropwizardAppExtension instances.

Do you think we could wait until the test framework to be more interactable with the config store that we refactor this test?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually thought that test with the two DropwizardAppExtension was pretty good given what we have now. The suggestion you made for an annotation sounds great, but I definitely don't think that's a blocker for this PR.

My comment here is specifically about having a single boolean config specific to WASB (+WASBS) and how that doesn't scale.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know that we need to extend this configuration to support s3n, etc. I do think we could push the check into the StorageLocation class so that it can be responsible for tracking which scheme(s) it supports and which should be rejected at creation. So, e.g., the AzureLocation class can implement a canCreateNew() method or something and it can, internally, determine whether wasb is supported or not. Some future S3StorageLocation can track s3x URLs, though AFAIK there's no reason to explicitly add/deny support for any of those variations are supported by the existing Iceberg S3FileIO.

Copy link
Contributor

@eric-maynard eric-maynard Oct 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we probably should support it. Indeed, don't we already block http paths?

Also, why is this specific to catalog creation? The approach here just seems very specific to me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that rationalizing this with the StorageLocation or FileIO or StorageConfigurationInfo or StorageType types is a good idea. Right now it seems like the logic for dealing with different schemes and file systems is not centralized enough.

.key("SUPPORT_WASB_CATALOG")
.description(
"If set to true, allows the creation of catalogs with storage locations using Azure WASB prefix.")
.defaultValue(false)
Copy link
Contributor

@eric-maynard eric-maynard Oct 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering we support FILE storage by default, I don't think we need to be so strict so as to not support WASB by default.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think FILE is different, as it works out of the box. wasb does not. In my mind, opting in to something that doesn't work quite right is different from opting out of something you don't intend to work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it the case that WASB does work out of the box for Polaris just like FILE, but the concern is that some clients might not be able to use credential vending with WASB (until the next Iceberg release)? Clients also can't use credential vending with FILE.

@@ -533,6 +534,24 @@ private boolean catalogOverlapsWithExistingCatalog(CatalogEntity catalogEntity)
});
}

private boolean catalogLocationsContainWasb(CatalogEntity catalogEntity) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quite a thin method and only used in one place. If we are going to make it its own method, can we at least make it more visible to test it?

@@ -903,7 +903,9 @@ components:

AzureStorageConfigInfo:
type: object
description: azure storage configuration info
description:
azure storage configuration info; base location supports abfs[s] schemes and only supports wasb[s] schemes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the code, it's not just the base location

@@ -28,9 +28,13 @@ public class AzureLocation extends StorageLocation {
private static final Pattern URI_PATTERN = Pattern.compile("^(abfss?|wasbs?)://([^/?#]+)(.*)?$");

public static final String ADLS_ENDPOINT = "dfs.core.windows.net";

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: spurious change?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants