Skip to content

Edge worker sometimes fails on startup with 405: Method not allowed #60261

@ron-gaist

Description

@ron-gaist

Apache Airflow Provider(s)

edge3

Versions of Apache Airflow Providers

apache-airflow-providers-celery==3.13.0
apache-airflow-providers-common-compat==1.8.3
apache-airflow-providers-common-io==1.6.4
apache-airflow-providers-common-sql=1.28.2
apache-airflow-providers-fab==3.0.1
apache-airflow-providers-standard==1.9.1
apache-airflow-providers-postgres==6.4.0
apache-airflow-providers-edge3==1.4.0

Apache Airflow version

3.1.2

Operating System

Debian GNU/Linux 12 (bookworm)

Deployment

Official Apache Airflow Helm Chart

Deployment details

Setup scale: (yes we know this is very large scale)
8 api servers x 64 worker processes each
9 Edge workers attempting to connect continuously (and continuously failing)
etc... (rest are not relevant)

What happened

Upon edge worker startup we get the following error:

405 Client Error: Method not allowed for url: http:///edge_worker/v1/worker/

Possible relevant values in the helm chart values.yaml:

apiSecretKey: <VALUE>
jwtSecret: <VALUE>
config:
    edge:
         api_enabled: 'True'
         api_url: ''http://<AIRFLOW-API-SERVER-URL>/edge_worker/v1/rcpapi'

This behavior seems to come and go - workers can fail on startup and after some time start up normally.

What you think should happen instead

No response

How to reproduce

Reproducing this seems to be tricky. Even for us this behavior sometimes exists and other times not. The best information we can give for reproducing this seems to be the airflow config.

Anything else

405 Method not allowed seems to suggest that the API servers did not start with the necessary portion of the edge plugin.
This is very odd, and is the kind of behavior we'd expect if the config lacked edge: api_enabled: 'True', api_url: '...'
However we double and triple checked that these configurations are set in the api servers (and on the edge worker too, for what it's worth)

Additional note (could be related):
Upon startup of each api-server we get an error related to edge plugin import:

[error] Failed to import plugin edge_executor [airflow.plugins_manager] loc=plugins_manager.py:268
....
psycop2.errors.LockNotAvailable: canceling statement due to lock timeout

We checked the lock timeout on our postgres db and it is set to 0 (unlimited)

We aren't pointing to this as the root cause because previously edge has worked for us while this plugin has been similarly not imported. But this might point in the right direction
We know this isn't related to the amount of api_servers (we tried with 1) but it might be related to the amount of worker processes - we don't know for sure.

Thank you for the time considering our issue!!

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    area:providerskind:bugThis is a clearly a bugneeds-triagelabel for new issues that we didn't triage yet

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions