Skip to content

Spec: Add SigV4 Auth Support for Catalog Federation #1506

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
May 20, 2025

Conversation

XJDKC
Copy link
Member

@XJDKC XJDKC commented May 1, 2025

Context

Prior Catalog Federation work:

Reference

Below are all the reference mentioned in this PR:

Description

This PR adds the SigV4 Auth Support so that Polaris can federate to Glue IRC / S3Tables / catalog server hosted behind the AWS API Gateway.

This PR is based on this new proposal for creds management in Polaris:

Apache Polaris Creds Management Proposal

Details

User-provided properties:

  • roleArn: the AWS IAM role urn, polaris will assume the role to get a tmp aws session credential
  • roleSessionName: The role session name to be used by the SigV4 protocol for signing requests
  • externalId: An optional external id used to establish a trust relationship with AWS in the trust policy
  • signingRegion: Region to be used by the SigV4 protocol for signing requests
  • signingName: The service name to be used by the SigV4 protocol for signing requests, the default signing name is "execute-api" is if not provided
    • AWS Glue: glue
    • AWS S3Tables: s3tables
    • AWS API Gateway: execute-api

Service-provided properties:

  • serviceIdentity: it's a nested object. In this PR, I only added two identity types:
    • AWS_IAM_USER: Polaris is configured with long-lived AWS credentials from an IAM user. It uses these credentials to assume the IAM role provided by Polaris users, then accesses the remote catalog using the subscoped temp credentials.
      • userArn: The aws user arn used to assume the aws role, this represents the polaris service itself, overall steps:
        • Step 1: Polaris will use an user aws credential to assume the IAM role, this aws credential is configured for polaris, usually polaris will read it from environment variable
        • Step 2: After assuming the role, polaris will get the tmp aws credential, polaris will put these credentials into the prop map, this prop map will be used to initialize the REST Catalog Client, properties can be found in AwsProperties:
          • rest.signing-region
          • rest.signing-name
        • Step 3: RESTSigV4Signer will sign the requests based on SigV4 protocol
    • AWS_IAM_ROLE: Polaris retrieves temporary credentials via Amazon ECS container credentials or Amazon EC2 instance IAM role credentials, then assumes the polaris user-provided IAM role to get new temporary credentials to access the remote catalog.

Copy link
Collaborator

@adnanhemani adnanhemani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have the same comments as @dimas-b and @singhpk234 - but this should be good for now!

@XJDKC
Copy link
Member Author

XJDKC commented May 5, 2025

Hi folks, any other concerns about this spec change?

There is one remaining thing we haven't made a decision:
For non-STS based SigV4 authentication, how do we plan to support it?

1. From env vars or server config, e.g.:

  • POLARIS_IAM_USER_AWS_ACCESS_KEY_ID
  • POLARIS_IAM_USER_AWS_SECRET_ACCESS_KEY
  • POLARIS_IAM_USER_ARN
    In this case, roleArn would not be required.

2. Configured via the Polaris Management API: Stick to SigV4AuthenticationParameters

If we stick with the existing SigV4AuthenticationParameters type, we could:

  • Make roleArn optional
  • Add iamUserAwsAccessKeyId and iamUserAwsSecretAccessKey as optional fields

3. Configured via the Polaris Management API: Add new auth type

We could create a new type to distinguish clearly:

  • New AuthenticationType enum: SIGV4_STS, SIGV4_STATIC_CREDS

4. Configured via the Polaris Management API: Add new auth types

We could create a new sub type to distinguish clearly:
e.g. new subtype under SigV4AuthenticationParameters: STS, CREDS

Do you have any perference or is the current proposal good to go?

@adnanhemani
Copy link
Collaborator

@XJDKC - sorry, I know I had previously approved this proposal but reading through some of the comments, I'm questioning something:

Why do we require a IAM User for Polaris? My understanding is that today, Polaris already has access to some IAM Role credentials (through the Default Credential Provider) - why do we need to introduce the new complexity of an IAM User? If we assume that Polaris has access to the IAM Role, then we no longer need the workaround for cross-account access as well.

Can you clarify on this bit?

On non-STS SigV4 authN, I don't really have a strong preference - and I don't think adopting any of the options constitutes this PR as a one-way door. So I'm okay to punt on it for now.

@XJDKC
Copy link
Member Author

XJDKC commented May 5, 2025

@XJDKC - sorry, I know I had previously approved this proposal but reading through some of the comments, I'm questioning something:

Why do we require a IAM User for Polaris? My understanding is that today, Polaris already has access to some IAM Role credentials (through the Default Credential Provider) - why do we need to introduce the new complexity of an IAM User? If we assume that Polaris has access to the IAM Role, then we no longer need the workaround for cross-account access as well.

Can you clarify on this bit?

On non-STS SigV4 authN, I don't really have a strong preference - and I don't think adopting any of the options constitutes this PR as a one-way door. So I'm okay to punt on it for now.

Hey @adnanhemani, there is an on-going discussion about this PR: https://lists.apache.org/thread/rlbxvw0xmzvlfm7pdh97bs3xvq7o8lmy

This thread also contains some details about it: #1506 (comment)

In short, to get an IAM Role credentials, polaris needs to use a long-lived credential (represents the polaris service identity) to assume the role (provided by polaris users) and get a temp credential. This is the solution provided by AWS IAM to grant permissions to AWS account B (polaris service provider) from AWS account A (polaris users).

@dimas-b
Copy link
Contributor

dimas-b commented May 5, 2025

I believe the STS credentials concern is not specific to this PR. We can probably deal with that in another PR.

dennishuo
dennishuo previously approved these changes May 6, 2025
Copy link
Contributor

@dennishuo dennishuo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM given the consistency with StorageConfig, though I agree we should probably refine the way we convey "output-only" concepts like the userArn, and get agreement with other reviewers on how best to convey that distinction before merging this PR (maybe simply annotating with the openApi field readOnly is enough, or we could go further and move the readOnly fields into a separate tuple such as vendorInfo like you suggested elsewhere).

@github-project-automation github-project-automation bot moved this from PRs In Progress to Ready to merge in Basic Kanban Board May 6, 2025
@XJDKC
Copy link
Member Author

XJDKC commented May 9, 2025

Hi folks, I've updated the spec to extract the Polaris service identity info into a nested object.

Right now, I've added support for two types of service identities:

  • AWS_IAM_USER: Polaris is configured with long-lived AWS credentials from an IAM user. It uses these credentials to assume the IAM role provided by Polaris users, then accesses the remote catalog using the subscoped temp credentials.
  • AWS_IAM_ROLE: Polaris retrieves temporary credentials via Amazon ECS container credentials or Amazon EC2 instance IAM role credentials, then assumes the polaris user-provided IAM role to get new temporary credentials to access the remote catalog.

The service identity info is part of the SigV4AuthenticationParameters, it's read only and will be surfaced to polaris users in the response of getCatalog / listCatalogs / updateCatalog

Looking ahead, we may introduce additional service identity types (e.g. GCP Service Account) and aim to unify the behavior between storage config and connection config for consistency.

I have drafted a proposal for the credential management in Apache Polaris, would love to hear your thoughts, feedback, and suggestions.
Apache Polaris Creds Management Proposal

cc: @dennishuo @dimas-b

Copy link
Contributor

@dimas-b dimas-b left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM after a quick glance. I'll take a more thorough read on Mon. Hope there's no rush :)

@XJDKC
Copy link
Member Author

XJDKC commented May 13, 2025

Just wanted to give a warm heads-up on this PR: should we consider making serviceIdentityInfo a top-level property of ConnectionConfigInfo / AuthenticationParameters / StorageConfigInfo? Right now, I put it inside the SigV4AuthenticationParameters.

Comment on lines 969 to 970
serviceIdentity:
$ref: '#/components/schemas/ServiceIdentityInfo'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As @XJDKC mentioned in the main comment thread, it may be worth taking this to a higher-level...

Do we envision on ServiceIdentityInfo for the whole Realm, or perhaps one ServiceIdentityInfo for each Catalog?

From my POV the latter case is possible, but I wonder what other people this about its practical usefulness.

I suppose the former case is more relevant, so it does make sense to expose ServiceIdentityInfo as a top-level entity... Perhaps we can have a separate Management API endpoint for it (e.g. "/configuration")?

Copy link
Member Author

@XJDKC XJDKC May 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Dmitri, I discussed the proposal of adding a separate set of management APIs for handling service identity info in my design doc (topic 2, proposal 3): Apache Polaris Creds Management Proposal

Here is the pros and cons:

  • Pros
    • Clean separation of concerns: Identity info isn’t mixed into catalog or storage config anymore.
    • Simplifies catalog schema: Keeps it focused strictly on user input.
    • More extensible: We can evolve the service identity model without touching the catalog schema or breaking clients.
    • Flexible: Works well for both self-managed and SaaS-style deployments.
  • Cons
    • Requires extra coordination: Users must make an additional API call to fetch identity info and coordinate across two APIs for full setup.
    • Less cohesive: Identity context is no longer colocated with the catalog entity. This assumes Polaris uses the same identity across all catalogs in a given realm.
    • Not aligned with current behavior: Today, identity fields are visible in the catalog response, this approach would change that.
    • Dynamic fields challenge: Some service-managed fields like consentUrl depend on user-provided input (e.g., Azure tenant ID) and can’t be precomputed globally, they need to be generated after Polaris receives the config.

Also, some specific vendors/users may want to use different service identities to access different external services.
e.g. use SIGV4 auth to access Glue/Amazon API Gateway (host polaris behind the API Gateway), but the table is stored in azure blob (or s3 comp storage).

In that case, would it be better to include serviceIdentityInfo as a top-level field in both the storage config and connection config (or inside the authentication params object of connection config)? That way, the scope is clearly limited to the relevant config, reducing the blast radius and keeping things more modular.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Dmitri, just a warm heads up, given the cons and potential issues I posted, if moving the service identity to the top level is the final request, I'll go ahead and make that update. Let me know if you have any thoughts!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the analysis, @XJDKC !

The "Dynamic fields challenge" is very interesting. If I understand correctly, the user first creates an external catalog with some connection config details. The server processes that information and returns its identity information.

WDYT about making ServiceIdentityInfo a read-only sibling property to connectionConfigInfo in ExternalCatalog?

From my POV that approach still achieves separation of concerns between user-provided and server-provided data (server-provided in not embedded inside user-provided). At the same time, it allows for different server identities for each catalog.

Copy link
Member Author

@XJDKC XJDKC May 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @dimas-b, there's another concern to consider here: for a specific catalog, Polaris might use different service identities to access different external services. e.g.

Use SIGV4 auth to access Glue or Amazon API Gateway (hosting Polaris behind the gateway), while the actual table data is stored in Azure Blob or an S3-compatible storage.

In this case, Polaris would use one AWS identity (e.g., an IAM user) to access the remote catalog server, and a different identity to access the storage.

To support this kind of setup cleanly, it would make sense to define separate service identities for both the storage config and the connection config. This also helps maintain backward compatibility with the existing design.

Copy link
Member Author

@XJDKC XJDKC May 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, the serviceInfoAuthLocatorUrn is not provided by end user. All the properties in serviceIdentity should be provided by polaris itself. The different is:
serviceArn is surfaced to polaris users so that they can add it (IAM user arn) to their trust relationship of IAM role.
serviceInfoAuthLocatorUrn: is used to look up the IAM user's credential from the secret manager, it's owned by polaris services, polaris will assign an IAM user for a specific Catalog Entity for multi-tenant polaris deployment. It won't be surfaced to users.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case should serviceInfoAuthLocatorUrn be exposed in the user-facing API?

As I understand, it does not expose actual credentials, just a reference to them, but this information seems to be irrelevant for users of the API 🤔

We could still keep it in ServiceIdentityDpo in the Polaris Server, just not expose in API. WDYT?

Copy link
Member Author

@XJDKC XJDKC May 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's my plan, serviceInfoAuthLocatorUrn will only exist in ServiceIdentityDpo and won't be included in the ServiceIdentityInfo in the API response.

Since this is purely internal and only Polaris needs it to retrieve service identity's credentials (e.g., from a secret manager or credential store), it's best to keep it hidden from end users. That way, we reduce exposure of implementation details and avoid potential security risks or confusion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed explanations, much appreciated!

Given that the location URN is not surfaced to users, I think that this option (2) would have my preference. But we should maybe look into migrating the storage config to use the same system eventually.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, in the long term, we should use the same system for both storage config and connection config.

I discussed this in the proposal, feel free to take a look if you're interested (though it's a bit outdated based on the latest discussions, and I'll update it soon).: Apache Polaris Creds Management Proposal

Copy link
Contributor

@dimas-b dimas-b left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spec change LGTM. Thanks @XJDKC !

Let's keep the PR open for a few more days in case other reviews have more comments.

Copy link
Contributor

@dennishuo dennishuo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@XJDKC
Copy link
Member Author

XJDKC commented May 16, 2025

Thanks everyone for reviewing this PR! I really appreciate the thoughtful comments which helped me think more thoroughly about the problem and refine the proposal to cover more use cases.

I'll leave this PR open until Monday to allow any final feedback. After that, I'll start implementing based on the discussions and the updated spec.

The initial implementation will focus on SigV4 Auth support for Catalog Federation. Longer term, I hope we can move toward a unified model for how Polaris handles both storage and connection credential management across deployment modes.

@dimas-b
Copy link
Contributor

dimas-b commented May 19, 2025

I propose to merge on May 20

@dimas-b dimas-b merged commit 24315cf into apache:main May 20, 2025
6 checks passed
@github-project-automation github-project-automation bot moved this from Ready to merge to Done in Basic Kanban Board May 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants