-
Notifications
You must be signed in to change notification settings - Fork 125
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Block the creation of catalog with WASB[S] locations #376
base: main
Are you sure you want to change the base?
Conversation
6b6438d
to
2e10e01
Compare
e1189cb
to
12b8245
Compare
public static final String ABFS_SCHEME = "abfs://"; | ||
public static final String ABFSS_SCHEME = "abfss://"; | ||
public static final String WASB_SCHEME = "wasb://"; | ||
public static final String WASBS_SCHEME = "wasbs://"; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typically the scheme
is not considered to include the ://
. I also think there might be a more extensible way to capture this than 4 constants.
Furthermore, isn't this duplicative of the constants in WasbTranslatingFileIO
as well as StorageType
?
@@ -191,4 +191,12 @@ public static <T> Builder<T> builder() { | |||
"If set to true, allows tables to be dropped with the purge parameter set to true.") | |||
.defaultValue(true) | |||
.build(); | |||
|
|||
public static final PolarisConfiguration<Boolean> SUPPORT_WASB_CATALOG = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should re-examine this approach. It's not clear to me how this pattern can extend to having more than 2 schemes -- for example S3 has s3n
, s3a
, and s3
. Also, conceivably, you might want to allow just abfss
and not abfs
or vice versa.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree that the current approach is not at all extensible. I was hoping we have a test annotation like @WithConfig(configName="true")
that could directly modify the config store, but that would require a larger test framework change that would fall out of scope of this PR. Currently the config store only contains getter methods and there's also no way to modify its content via code via a setter method, so I had to proceed with this pattern of initiating two DropwizardAppExtension
instances.
Do you think we could wait until the test framework to be more interactable with the config store that we refactor this test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually thought that test with the two DropwizardAppExtension
was pretty good given what we have now. The suggestion you made for an annotation sounds great, but I definitely don't think that's a blocker for this PR.
My comment here is specifically about having a single boolean config specific to WASB (+WASBS) and how that doesn't scale.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know that we need to extend this configuration to support s3n
, etc. I do think we could push the check into the StorageLocation
class so that it can be responsible for tracking which scheme(s) it supports and which should be rejected at creation. So, e.g., the AzureLocation
class can implement a canCreateNew()
method or something and it can, internally, determine whether wasb
is supported or not. Some future S3StorageLocation
can track s3x
URLs, though AFAIK there's no reason to explicitly add/deny support for any of those variations are supported by the existing Iceberg S3FileIO
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we probably should support it. Indeed, don't we already block http
paths?
Also, why is this specific to catalog creation? The approach here just seems very specific to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that rationalizing this with the StorageLocation
or FileIO
or StorageConfigurationInfo
or StorageType
types is a good idea. Right now it seems like the logic for dealing with different schemes and file systems is not centralized enough.
.key("SUPPORT_WASB_CATALOG") | ||
.description( | ||
"If set to true, allows the creation of catalogs with storage locations using Azure WASB prefix.") | ||
.defaultValue(false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Considering we support FILE
storage by default, I don't think we need to be so strict so as to not support WASB by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think FILE
is different, as it works out of the box. wasb
does not. In my mind, opting in to something that doesn't work quite right is different from opting out of something you don't intend to work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't it the case that WASB does work out of the box for Polaris just like FILE
, but the concern is that some clients might not be able to use credential vending with WASB (until the next Iceberg release)? Clients also can't use credential vending with FILE
.
@@ -533,6 +534,24 @@ private boolean catalogOverlapsWithExistingCatalog(CatalogEntity catalogEntity) | |||
}); | |||
} | |||
|
|||
private boolean catalogLocationsContainWasb(CatalogEntity catalogEntity) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quite a thin method and only used in one place. If we are going to make it its own method, can we at least make it more visible to test it?
@@ -903,7 +903,9 @@ components: | |||
|
|||
AzureStorageConfigInfo: | |||
type: object | |||
description: azure storage configuration info | |||
description: | |||
azure storage configuration info; base location supports abfs[s] schemes and only supports wasb[s] schemes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the code, it's not just the base location
@@ -28,9 +28,13 @@ public class AzureLocation extends StorageLocation { | |||
private static final Pattern URI_PATTERN = Pattern.compile("^(abfss?|wasbs?)://([^/?#]+)(.*)?$"); | |||
|
|||
public static final String ADLS_ENDPOINT = "dfs.core.windows.net"; | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: spurious change?
Description
Polaris supports creating catalogs with the deprecated wasbs:// scheme (as opposed to the newer abfss://). However, the current implementation of Iceberg SDK's default ResolvingFileIO would fallback to using the HadoopFileIO when it's provided a wasbs:// path, where the fileIO does not support using vended credentials to read/write to Azure location, as opposed to the ADLSFileIO which supports credential vending and is used for abfss:// paths. This results in when external engines (e.g. Spark) are provided by Polaris a wasbs:// path and a token to access the location, the token cannot be accepted by the fallback HadoopFileIO.
In short, despite Polaris being capable of vending credentials for wasbs:// paths, external engines using the Iceberg SDK for fileIO implementation (which generally is the case) couldn't pick up the vended credentials, and users would need to provide the external engines their own credentials to perform I/O on the location.
This holds until the proposed fix from the Iceberg community is released and users adopt the Iceberg SDK version that contains the fix.
Until then, Polaris would provide a configurable parameter to decide whether to support the creation of wasbs:// pathed catalogs, and the default value would be False (no support).
Type of change
Please delete options that are not relevant.
How Has This Been Tested?
Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration
Test Configuration:
Checklist:
Please delete options that are not relevant.