-
Notifications
You must be signed in to change notification settings - Fork 238
Remove Java URI validations for Blob Storage providers #1604
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Can you elaborate on when |
I'm also confused on having multiple PRs for the same issue. |
Please see #1586 (comment) and #1586 (comment) |
Blob Storage providers like S3 and GCS do not recognize ".." as a path traversal. If you store an object in S3 with the path "s3://abcd/a/b/../c", S3 stores the object exactly as such - it will not normalize the path into "s3://abcd/a/c". While it may be an anti-pattern to make such a location, we should not deviate from what the blob storage provider allows/does. The same can be said about multiple forward slashes together ("s3://abcd/a/b//c" is not the same as "s3://abcd/a/b/c"). As a result, we should only allow ".." and "." normalization in local filesystem calls where these can be resolved - and those are the code paths where Java URI is being used in this PR. WDYT? |
@@ -44,8 +44,10 @@ protected StorageLocation(@Nonnull String location) { | |||
this.location = URI.create(location.replaceFirst("file:/+", LOCAL_PATH_PREFIX)).toString(); | |||
} else if (location.startsWith("/")) { | |||
this.location = URI.create(location.replaceFirst("/+", LOCAL_PATH_PREFIX)).toString(); | |||
} else { | |||
} else if (location.startsWith(LOCAL_PATH_PREFIX)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if location is FILE:////////tmp/smth
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call out - fixed!
@@ -44,8 +44,10 @@ protected StorageLocation(@Nonnull String location) { | |||
this.location = URI.create(location.replaceFirst("file:/+", LOCAL_PATH_PREFIX)).toString(); | |||
} else if (location.startsWith("/")) { | |||
this.location = URI.create(location.replaceFirst("/+", LOCAL_PATH_PREFIX)).toString(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have tests for multiple leading /
chars?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, we do not, but am introducing in revision. Thanks!
In ADLS Sure, in S3 + GCS the path is a plain string. But are patterns like |
I've heard stories about |
Assertions.assertThat(manySlashLeadingLocation.equals(standardLocation)).isTrue(); | ||
Assertions.assertThat(fileSingleSlashLocation.equals(standardLocation)).isTrue(); | ||
Assertions.assertThat(fileTooManySlashesLocation.equals(standardLocation)).isTrue(); | ||
Assertions.assertThat(standardLocation instanceof AzureLocation).isFalse(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the meaning of this assertion? Is it not preferable to assert exact expected type rather that asserting what the type is not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good question! AzureLocation
is an extension of StorageLocation
(due to Azure's very specific way of defining locations). All other locations remain instances of StorageLocation
. So if I test that any of these locations (including an instance of AzureLocation) are an instance of StorageLocation
, they'll always return true.
This test is basically just testing that we are not creating AzureLocation
subclassed objects for these paths.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cf. assertThat(...).isExactlyInstanceOf(...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR looks like it provides incremental values as it avoids interpreting storage URIs as RFC-compliant URI. I'd be fine merging it.
I have one comment about test code, though.
Agreed with @dimas-b here. Although users should probably not be doing this, I'd prefer we not make Polaris opinionated on this (since we have no good reason to do so). |
This is a retry at #1586, which I'll close if this PR is on the right direction instead.
Java URI does not actually apply any normalization to URIs if we do not call
URI.normalize()
(which we currently do not). Additionally, blob storage providers like S3 and GCS can provide ".." and "." as valid fragments in URLs - which Java URI would attempt to normalize incorrectly. As a result, attempting to validate and/or normalize URIs for blob storage providers using the Java URI class is the incorrect behavior. While we may want to add location validation via regex later, removing it first should at least unblock the bug we see in #1545.