Skip to content

Two-token mechanism for task execution to prevent token expiration while tasks wait in executor queues#60108

Open
anishgirianish wants to merge 11 commits intoapache:mainfrom
anishgirianish:fix/token-expiration-worker
Open

Two-token mechanism for task execution to prevent token expiration while tasks wait in executor queues#60108
anishgirianish wants to merge 11 commits intoapache:mainfrom
anishgirianish:fix/token-expiration-worker

Conversation

@anishgirianish
Copy link
Copy Markdown
Contributor

@anishgirianish anishgirianish commented Jan 4, 2026


Summary

Tasks waiting in executor queues (Celery, Kubernetes) can have their JWT tokens expire before execution starts, causing auth failures on the Execution API. This is a real problem in production, when queues back up or workers are slow to pick up tasks, the original short-lived token expires and the worker gets a 403 when it finally tries to start the task.

Fixes: #53713
Related: #59553
closes: #62129

Approach

Two-token mechanism: a long-lived workload token (24h default, configurable) travels with the task through the queue, and a short-lived execution token is issued when the task actually starts running.

The workload token carries a scope: "workload" claim and is restricted to the /run endpoint only, enforced via FastAPI SecurityScopes and a custom ExecutionAPIRoute. When /run succeeds, it returns an execution token via Refreshed-API-Token header. The SDK client picks it up and uses it for all subsequent API calls. The existing JWTReissueMiddleware handles refreshing execution tokens near expiry and skips workload tokens.

For dag.test() / InProcessExecutionAPI, auth is bypassed and a stub JWTGenerator with a random secret is used so no signing key configuration is needed.

New config: execution_api.jwt_workload_token_expiration_time (default 86400s)

Built on @ashb's SecurityScopes foundation.

Security considerations

Even if a workload token is intercepted, it can only call /run which already guards against running a task more than once (returns 409 if the task isn't in QUEUED/RESTARTING state). All other endpoints reject workload tokens , they require execution scope. The execution token issued by /run is short-lived and automatically refreshed, keeping the existing security posture for all API calls during task execution.

Testing

Tested end-to-end with CeleryExecutor in Breeze, triggered a DAG, confirmed tasks completed successfully with the token swap happening transparently. Unit tests cover token generation, scope enforcement (accepted on /run, rejected elsewhere), invalid scope handling, execution token header in response, SDK client token swap and priority, and registry teardown to prevent test pollution.


^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in airflow-core/newsfragments.

@boring-cyborg boring-cyborg bot added area:API Airflow's REST/HTTP API area:task-sdk labels Jan 4, 2026
@anishgirianish anishgirianish force-pushed the fix/token-expiration-worker branch from b183c74 to 9c31417 Compare January 4, 2026 21:05
@anishgirianish anishgirianish force-pushed the fix/token-expiration-worker branch 3 times, most recently from c707ddc to 4ef9dfe Compare January 4, 2026 22:45
@eladkal eladkal added this to the Airflow 3.1.6 milestone Jan 6, 2026
@tirkarthi
Copy link
Copy Markdown
Contributor

As per my understanding this was removed in #55506 to use a middleware that refreshes token. Are you running an instance with execution api only separately with api-server? Could this middleware approach be extended for task-sdk calls too?

cc: @vincbeck @pierrejeambrun

@anishgirianish
Copy link
Copy Markdown
Contributor Author

Hi @tirkarthi,
Thanks for pointing out the middleware approach from #55506 - that's helpful context.

I took a stab at extending that pattern in #60197, handling expired tokens transparently in JWTBearer + middleware so no client-side changes are needed. Would love your thoughts on it.

Totally happy to go with whichever approach the team feels is better!

cc: @vincbeck @pierrejeambrun

@vincbeck
Copy link
Copy Markdown
Contributor

vincbeck commented Jan 7, 2026

Hi @tirkarthi, Thanks for pointing out the middleware approach from #55506 - that's helpful context.

I took a stab at extending that pattern in #60197, handling expired tokens transparently in JWTBearer + middleware so no client-side changes are needed. Would love your thoughts on it.

Totally happy to go with whichever approach the team feels is better!

cc: @vincbeck @pierrejeambrun

Would love to hear @ashb or @amoghrajesh 's opinion on this one

Copy link
Copy Markdown
Member

@ashb ashb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't do this approach. It lets any Execution API token be resurrected which fundamentally breaks lots of security assumptions -- it amounts to having tokens not expire. That is bad.

Instead what we should do is generate a new token (i.e. ones with extra/different set of JWT claims) that is only valid for the /run endpoint and valid for longer (say 24hours, make it configurable) and this is what gets sent in the workload.

The run endpoint then would set the header to give the running task a "short lived" token (the one we have right now basically) that is usable on the rest of the Execution API. This approach is safer as the existing controls in the /run endpoint already prevent a task being run one than once, which should also prevent against "resurrecting" an expired token and using it to access things like connections etc. And we should validate that the token used on all endpoints but run is explicitly lacking this new claim.

Copy link
Copy Markdown
Member

@ashb ashb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much better approach, and on the right track, thanks.

Some changes though:

  • "queue" is not the right thing to use, as these tokens could be used for executing other workloads soon (for instance we have already talked about wanting Dag level callbacks to be executed on the workers, not in the dag processor, which would be done by having a new type from the ExecuteTaskWorkload).

    so maybe we have "scope": "ExecuteTaskWorkload"?

  • A little bit of refactoring is needed before we are ready to merge this.

Comment thread airflow-core/src/airflow/api_fastapi/auth/tokens.py Outdated
Comment thread airflow-core/src/airflow/api_fastapi/execution_api/deps.py Outdated
Comment thread airflow-core/src/airflow/api_fastapi/execution_api/app.py Outdated
Comment thread airflow-core/src/airflow/api_fastapi/execution_api/app.py Outdated
Comment thread airflow-core/src/airflow/api_fastapi/execution_api/app.py Outdated
Comment thread airflow-core/src/airflow/api_fastapi/execution_api/deps.py
Comment thread airflow-core/src/airflow/api_fastapi/execution_api/routes/task_instances.py Outdated
Comment thread airflow-core/src/airflow/config_templates/config.yml Outdated
@ashb ashb self-requested a review January 9, 2026 12:09
@anishgirianish anishgirianish force-pushed the fix/token-expiration-worker branch from e7e3ae1 to e879863 Compare January 9, 2026 23:52
@anishgirianish anishgirianish changed the title Add token refresh mechanism for Execution API (#59553) Two-token mechanism for task execution to prevent token expiration while tasks wait in executor queues (#59553) Jan 10, 2026
@anishgirianish anishgirianish force-pushed the fix/token-expiration-worker branch from b511b8f to 57ac225 Compare January 10, 2026 07:07
@kaxil kaxil modified the milestones: Airflow 3.1.9, Airflow 3.2.1 Mar 26, 2026
Comment thread airflow-core/tests/unit/api_fastapi/execution_api/conftest.py
Comment thread airflow-core/src/airflow/api_fastapi/execution_api/routes/task_instances.py Outdated
Comment thread airflow-core/src/airflow/api_fastapi/auth/tokens.py
Comment thread airflow-core/src/airflow/api_fastapi/execution_api/app.py
Comment thread airflow-core/src/airflow/api_fastapi/execution_api/app.py Outdated
@anishgirianish anishgirianish force-pushed the fix/token-expiration-worker branch 7 times, most recently from bb96119 to 9eaf6dd Compare March 27, 2026 18:37
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Implements a two-token JWT flow for the Execution API so tasks can sit in executor queues without their auth expiring: a long-lived workload token is embedded in the queued workload, and a short-lived execution token is issued when the worker successfully calls PATCH /run and then refreshed normally during execution.

Changes:

  • Add configurable workload-token lifetime ([execution_api] jwt_workload_token_expiration_time) and generate workload-scoped tokens in executor workloads.
  • Allow workload-scoped tokens on PATCH /run and return an execution token in the Refreshed-API-Token response header.
  • Extend JWTGenerator to support per-token valid_for overrides; add/update unit tests and documentation to cover scope behavior and token swapping.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
devel-common/src/tests_common/test_utils/mock_executor.py Tighten JWTGenerator mocking via spec to avoid brittle tests.
airflow-core/tests/unit/executors/test_workloads.py Add tests validating workload token scope + expiry selection.
airflow-core/tests/unit/api_fastapi/execution_api/versions/head/test_task_instances.py Add tests for /run token header and scope enforcement across endpoints.
airflow-core/tests/unit/api_fastapi/execution_api/conftest.py Register a mocked JWTGenerator for execution API tests.
airflow-core/tests/unit/api_fastapi/auth/test_tokens.py Add coverage for valid_for override and scope claim behavior.
airflow-core/src/airflow/executors/workloads/base.py Generate workload-scoped tokens with workload-specific validity.
airflow-core/src/airflow/config_templates/config.yml Introduce jwt_workload_token_expiration_time configuration entry.
airflow-core/src/airflow/api_fastapi/execution_api/routes/task_instances.py Permit workload tokens on /run and issue execution token via response header.
airflow-core/src/airflow/api_fastapi/execution_api/app.py Skip refresh for workload tokens; add in-process container override for dag.test().
airflow-core/src/airflow/api_fastapi/auth/tokens.py Add workload_valid_for and valid_for override support in JWT generation.
airflow-core/docs/security/jwt_token_authentication.rst Document the two scopes, token delivery, swap on /run, and refresh behavior.

Comment thread airflow-core/src/airflow/api_fastapi/execution_api/app.py Outdated
Comment thread airflow-core/src/airflow/api_fastapi/execution_api/app.py Outdated
@anishgirianish
Copy link
Copy Markdown
Contributor Author

Hi @ashb @kaxil @amoghrajesh. Thank you very much for the last review. I have addressed all the feedback and updated the pr. I would like to request you for your review whenever you get a chance. Thank you so much.

@ashb
Copy link
Copy Markdown
Member

ashb commented Apr 13, 2026

Will look when I have a spare cycle, right now I'm digging in to issue 65010 though.

Copy link
Copy Markdown
Contributor

@jscheffl jscheffl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good to me and are a very good improvement in my view! Some (two) comments - not deal breaker but something to consider. Other feedback might be cool as well.

try:
generator: JWTGenerator = services.get(JWTGenerator)
execution_token = generator.generate(extras={"sub": str(task_instance_id), "scope": "execution"})
except Exception:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a broad exception here or can we point to specific one(s)?

type: integer
example: ~
default: "600"
jwt_workload_token_expiration_time:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought a moment about this and actually am thinking... do we need a config parameter for this? There is a parameter task_queued_timeout already (see https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#task-queued-timeout) - after this tasks are made failed if they starve in the queue.

Would it be also an idea to make expiration of JWT equal to the queue timeout? (Maybe a few seconds more as buffer...) - I see no reason making this JWT defautl to 24h if the task is evicted from queued state after 10min per default (which usually is a good idea to set higher in a loaded environment)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are right Jens, good catch.

Copy link
Copy Markdown
Member

@ashb ashb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but as Jens pointed out we should remove the extra new validity we used and use the existing queued timeout.

Oh, though I guess that could be not set, in which case those tokens would be valid for ever which probably isn't the correct behaviour. Hmmmm.

@jscheffl
Copy link
Copy Markdown
Contributor

LGTM, but as Jens pointed out we should remove the extra new validity we used and use the existing queued timeout.

Oh, though I guess that could be not set, in which case those tokens would be valid for ever which probably isn't the correct behaviour. Hmmmm.

The queued timeout is defaulted to float(600.0) - so should not be end-less. (I believe the default of only 10min is way too short in my view - and today if you make your first setup and have the first real load really takes a long time to understand where the problem lies...)
image

So we should point this out in the docs with this PR that JWT token lifetime is also influenced by this parameter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:task-sdk ready for maintainer review Set after triaging when all criteria pass.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ExecuteTask activity token can expire before the task starts running