-
Notifications
You must be signed in to change notification settings - Fork 238
Spec: Add SigV4 Auth Support for Catalog Federation #1506
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spec: Add SigV4 Auth Support for Catalog Federation #1506
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have the same comments as @dimas-b and @singhpk234 - but this should be good for now!
Hi folks, any other concerns about this spec change? There is one remaining thing we haven't made a decision: 1. From env vars or server config, e.g.:
2. Configured via the Polaris Management API: Stick to
|
@XJDKC - sorry, I know I had previously approved this proposal but reading through some of the comments, I'm questioning something: Why do we require a IAM User for Polaris? My understanding is that today, Polaris already has access to some IAM Role credentials (through the Default Credential Provider) - why do we need to introduce the new complexity of an IAM User? If we assume that Polaris has access to the IAM Role, then we no longer need the workaround for cross-account access as well. Can you clarify on this bit? On non-STS SigV4 authN, I don't really have a strong preference - and I don't think adopting any of the options constitutes this PR as a one-way door. So I'm okay to punt on it for now. |
Hey @adnanhemani, there is an on-going discussion about this PR: https://lists.apache.org/thread/rlbxvw0xmzvlfm7pdh97bs3xvq7o8lmy This thread also contains some details about it: #1506 (comment) In short, to get an IAM Role credentials, polaris needs to use a long-lived credential (represents the polaris service identity) to assume the role (provided by polaris users) and get a temp credential. This is the solution provided by AWS IAM to grant permissions to AWS account B (polaris service provider) from AWS account A (polaris users). |
I believe the STS credentials concern is not specific to this PR. We can probably deal with that in another PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM given the consistency with StorageConfig, though I agree we should probably refine the way we convey "output-only" concepts like the userArn
, and get agreement with other reviewers on how best to convey that distinction before merging this PR (maybe simply annotating with the openApi field readOnly
is enough, or we could go further and move the readOnly fields into a separate tuple such as vendorInfo
like you suggested elsewhere).
Hi folks, I've updated the spec to extract the Polaris service identity info into a nested object. Right now, I've added support for two types of service identities:
The service identity info is part of the SigV4AuthenticationParameters, it's read only and will be surfaced to polaris users in the response of Looking ahead, we may introduce additional service identity types (e.g. GCP Service Account) and aim to unify the behavior between storage config and connection config for consistency. I have drafted a proposal for the credential management in Apache Polaris, would love to hear your thoughts, feedback, and suggestions. cc: @dennishuo @dimas-b |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM after a quick glance. I'll take a more thorough read on Mon. Hope there's no rush :)
Just wanted to give a warm heads-up on this PR: should we consider making serviceIdentityInfo a top-level property of |
spec/polaris-management-service.yml
Outdated
serviceIdentity: | ||
$ref: '#/components/schemas/ServiceIdentityInfo' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As @XJDKC mentioned in the main comment thread, it may be worth taking this to a higher-level...
Do we envision on ServiceIdentityInfo
for the whole Realm, or perhaps one ServiceIdentityInfo
for each Catalog?
From my POV the latter case is possible, but I wonder what other people this about its practical usefulness.
I suppose the former case is more relevant, so it does make sense to expose ServiceIdentityInfo
as a top-level entity... Perhaps we can have a separate Management API endpoint for it (e.g. "/configuration")?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey Dmitri, I discussed the proposal of adding a separate set of management APIs for handling service identity info in my design doc (topic 2, proposal 3): Apache Polaris Creds Management Proposal
Here is the pros and cons:
- Pros
- Clean separation of concerns: Identity info isn’t mixed into catalog or storage config anymore.
- Simplifies catalog schema: Keeps it focused strictly on user input.
- More extensible: We can evolve the service identity model without touching the catalog schema or breaking clients.
- Flexible: Works well for both self-managed and SaaS-style deployments.
- Cons
- Requires extra coordination: Users must make an additional API call to fetch identity info and coordinate across two APIs for full setup.
- Less cohesive: Identity context is no longer colocated with the catalog entity. This assumes Polaris uses the same identity across all catalogs in a given realm.
- Not aligned with current behavior: Today, identity fields are visible in the catalog response, this approach would change that.
- Dynamic fields challenge: Some service-managed fields like consentUrl depend on user-provided input (e.g., Azure tenant ID) and can’t be precomputed globally, they need to be generated after Polaris receives the config.
Also, some specific vendors/users may want to use different service identities to access different external services.
e.g. use SIGV4 auth to access Glue/Amazon API Gateway (host polaris behind the API Gateway), but the table is stored in azure blob (or s3 comp storage).
In that case, would it be better to include serviceIdentityInfo as a top-level field in both the storage config and connection config (or inside the authentication params object of connection config)? That way, the scope is clearly limited to the relevant config, reducing the blast radius and keeping things more modular.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey Dmitri, just a warm heads up, given the cons and potential issues I posted, if moving the service identity to the top level is the final request, I'll go ahead and make that update. Let me know if you have any thoughts!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the analysis, @XJDKC !
The "Dynamic fields challenge" is very interesting. If I understand correctly, the user first creates an external catalog with some connection config details. The server processes that information and returns its identity information.
WDYT about making ServiceIdentityInfo
a read-only sibling property to connectionConfigInfo
in ExternalCatalog
?
From my POV that approach still achieves separation of concerns between user-provided and server-provided data (server-provided in not embedded inside user-provided). At the same time, it allows for different server identities for each catalog.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @dimas-b, there's another concern to consider here: for a specific catalog, Polaris might use different service identities to access different external services. e.g.
Use SIGV4 auth to access Glue or Amazon API Gateway (hosting Polaris behind the gateway), while the actual table data is stored in Azure Blob or an S3-compatible storage.
In this case, Polaris would use one AWS identity (e.g., an IAM user) to access the remote catalog server, and a different identity to access the storage.
To support this kind of setup cleanly, it would make sense to define separate service identities for both the storage config and the connection config. This also helps maintain backward compatibility with the existing design.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, the serviceInfoAuthLocatorUrn
is not provided by end user. All the properties in serviceIdentity
should be provided by polaris itself. The different is:
serviceArn
is surfaced to polaris users so that they can add it (IAM user arn) to their trust relationship of IAM role.
serviceInfoAuthLocatorUrn
: is used to look up the IAM user's credential from the secret manager, it's owned by polaris services, polaris will assign an IAM user for a specific Catalog Entity for multi-tenant polaris deployment. It won't be surfaced to users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case should serviceInfoAuthLocatorUrn
be exposed in the user-facing API?
As I understand, it does not expose actual credentials, just a reference to them, but this information seems to be irrelevant for users of the API 🤔
We could still keep it in ServiceIdentityDpo
in the Polaris Server, just not expose in API. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's my plan, serviceInfoAuthLocatorUrn
will only exist in ServiceIdentityDpo
and won't be included in the ServiceIdentityInfo
in the API response.
Since this is purely internal and only Polaris needs it to retrieve service identity's credentials (e.g., from a secret manager or credential store), it's best to keep it hidden from end users. That way, we reduce exposure of implementation details and avoid potential security risks or confusion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the detailed explanations, much appreciated!
Given that the location URN is not surfaced to users, I think that this option (2) would have my preference. But we should maybe look into migrating the storage config to use the same system eventually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, in the long term, we should use the same system for both storage config and connection config.
I discussed this in the proposal, feel free to take a look if you're interested (though it's a bit outdated based on the latest discussions, and I'll update it soon).: Apache Polaris Creds Management Proposal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spec change LGTM. Thanks @XJDKC !
Let's keep the PR open for a few more days in case other reviews have more comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thanks everyone for reviewing this PR! I really appreciate the thoughtful comments which helped me think more thoroughly about the problem and refine the proposal to cover more use cases. I'll leave this PR open until Monday to allow any final feedback. After that, I'll start implementing based on the discussions and the updated spec. The initial implementation will focus on SigV4 Auth support for Catalog Federation. Longer term, I hope we can move toward a unified model for how Polaris handles both storage and connection credential management across deployment modes. |
I propose to merge on May 20 |
Context
Prior Catalog Federation work:
Reference
Below are all the reference mentioned in this PR:
polaris/polaris-core/src/main/java/org/apache/polaris/core/storage/aws/AwsCredentialsStorageIntegration.java
Line 68 in ca35bba
Description
This PR adds the SigV4 Auth Support so that Polaris can federate to Glue IRC / S3Tables / catalog server hosted behind the AWS API Gateway.
This PR is based on this new proposal for creds management in Polaris:
Apache Polaris Creds Management Proposal
Details
User-provided properties:
roleArn
: the AWS IAM role urn, polaris will assume the role to get a tmp aws session credentialroleSessionName
: The role session name to be used by the SigV4 protocol for signing requestsexternalId
: An optional external id used to establish a trust relationship with AWS in the trust policysigningRegion
: Region to be used by the SigV4 protocol for signing requestssigningName
: The service name to be used by the SigV4 protocol for signing requests, the default signing name is "execute-api" is if not providedglue
s3tables
execute-api
Service-provided properties:
serviceIdentity
: it's a nested object. In this PR, I only added two identity types:userArn
: The aws user arn used to assume the aws role, this represents the polaris service itself, overall steps:rest.signing-region
rest.signing-name