Skip to content

Remove Java URI validations for Blob Storage providers #1604

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

adnanhemani
Copy link
Collaborator

@adnanhemani adnanhemani commented May 17, 2025

This is a retry at #1586, which I'll close if this PR is on the right direction instead.

Java URI does not actually apply any normalization to URIs if we do not call URI.normalize() (which we currently do not). Additionally, blob storage providers like S3 and GCS can provide ".." and "." as valid fragments in URLs - which Java URI would attempt to normalize incorrectly. As a result, attempting to validate and/or normalize URIs for blob storage providers using the Java URI class is the incorrect behavior. While we may want to add location validation via regex later, removing it first should at least unblock the bug we see in #1545.

@snazy
Copy link
Member

snazy commented May 19, 2025

Can you elaborate on when .. would be a valid in a path? For me, .. is a path-traversal in the presence of it raises the question "why is this not the source of a path-traversal-attack".

@snazy
Copy link
Member

snazy commented May 19, 2025

I'm also confused on having multiple PRs for the same issue.

@adnanhemani
Copy link
Collaborator Author

I'm also confused on having multiple PRs for the same issue.

Please see #1586 (comment) and #1586 (comment)

@adnanhemani
Copy link
Collaborator Author

Can you elaborate on when .. would be a valid in a path? For me, .. is a path-traversal in the presence of it raises the question "why is this not the source of a path-traversal-attack".

Blob Storage providers like S3 and GCS do not recognize ".." as a path traversal. If you store an object in S3 with the path "s3://abcd/a/b/../c", S3 stores the object exactly as such - it will not normalize the path into "s3://abcd/a/c". While it may be an anti-pattern to make such a location, we should not deviate from what the blob storage provider allows/does. The same can be said about multiple forward slashes together ("s3://abcd/a/b//c" is not the same as "s3://abcd/a/b/c").

As a result, we should only allow ".." and "." normalization in local filesystem calls where these can be resolved - and those are the code paths where Java URI is being used in this PR. WDYT?

@@ -44,8 +44,10 @@ protected StorageLocation(@Nonnull String location) {
this.location = URI.create(location.replaceFirst("file:/+", LOCAL_PATH_PREFIX)).toString();
} else if (location.startsWith("/")) {
this.location = URI.create(location.replaceFirst("/+", LOCAL_PATH_PREFIX)).toString();
} else {
} else if (location.startsWith(LOCAL_PATH_PREFIX)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if location is FILE:////////tmp/smth?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call out - fixed!

@@ -44,8 +44,10 @@ protected StorageLocation(@Nonnull String location) {
this.location = URI.create(location.replaceFirst("file:/+", LOCAL_PATH_PREFIX)).toString();
} else if (location.startsWith("/")) {
this.location = URI.create(location.replaceFirst("/+", LOCAL_PATH_PREFIX)).toString();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have tests for multiple leading / chars?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we do not, but am introducing in revision. Thanks!

@snazy
Copy link
Member

snazy commented May 20, 2025

In ADLS / is a literally a path separator, so I guess that has to be normalized.

Sure, in S3 + GCS the path is a plain string. But are patterns like /../ or /// in S3/GCS locations used in practice? I've seen usages of special characters like ", but not /../ or ///.

@dimas-b
Copy link
Contributor

dimas-b commented May 20, 2025

I've heard stories about /../ being used in S3 URI. I think it is preferable for Polaris not to interpret storage URI paths beyond determining "prefix" matches.

Assertions.assertThat(manySlashLeadingLocation.equals(standardLocation)).isTrue();
Assertions.assertThat(fileSingleSlashLocation.equals(standardLocation)).isTrue();
Assertions.assertThat(fileTooManySlashesLocation.equals(standardLocation)).isTrue();
Assertions.assertThat(standardLocation instanceof AzureLocation).isFalse();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the meaning of this assertion? Is it not preferable to assert exact expected type rather that asserting what the type is not?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good question! AzureLocation is an extension of StorageLocation (due to Azure's very specific way of defining locations). All other locations remain instances of StorageLocation. So if I test that any of these locations (including an instance of AzureLocation) are an instance of StorageLocation, they'll always return true.

This test is basically just testing that we are not creating AzureLocation subclassed objects for these paths.

Copy link
Contributor

@dimas-b dimas-b May 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cf. assertThat(...).isExactlyInstanceOf(...)

Copy link
Contributor

@dimas-b dimas-b left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR looks like it provides incremental values as it avoids interpreting storage URIs as RFC-compliant URI. I'd be fine merging it.

I have one comment about test code, though.

@adnanhemani
Copy link
Collaborator Author

But are patterns like /../ or /// in S3/GCS locations used in practice?

Agreed with @dimas-b here. Although users should probably not be doing this, I'd prefer we not make Polaris opinionated on this (since we have no good reason to do so).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants