Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add flow-level SLA property #5857

Open
anna-geller opened this issue Nov 8, 2024 · 1 comment · May be fixed by #5907
Open

Add flow-level SLA property #5857

anna-geller opened this issue Nov 8, 2024 · 1 comment · May be fixed by #5907
Assignees
Labels
area/backend Needs backend code changes enhancement New feature or request kind/customer-request Requested by one or more customers

Comments

@anna-geller
Copy link
Member

anna-geller commented Nov 8, 2024

Feature description

In the first iteration, we'll only add MAX_DURATION to satisfy use cases such as this one:

I have a flow that usually takes 10-15min to run. I see that It has been running for 4 days! Is there a way to kill a flow if it takes more than 20min?

sla:
  - type: MAX_DURATION
    maxDuration: PT20M
    behavior: KILL # optional property with the default being WARN
# CANCEL to cancel the execution gracefully i.e. not terminate currently running tasks; 
# KILL to kill immediately once the maxDuration is reached
    labels: # optional: add labels to easily identify all executions with missed SLA
      sla: missed

Context

Short Q&A to explain why this design was chosen.

Why is it a list?

In the future, the sla may evolve to have a specific behavior if given outputs/metrics violate the specified SLA requirements. Example:

sla:
  - type: OUTPUT_CONDITION
    condition: "{{ outputs.mytask.status_code == 200 }}"
    behavior: CANCEL
    labels: # optional arbitrary KV for filtering
       sla: badOutput

Why is it an ENUM rather than a plugin class?

The SLA is evaluated by the executor, so it shouldn't be a plugin that processes some data directly. The processing will need to be done by a worker and if some outputs/metrics are emitted by a plugin, we'll be able to capture them within the SLA.

Why is maxDuration counted from CREATED state?

Some executions never get to a RUNNING state for various reasons. Counting the maxDuration already from the moment the execution is CREATED, will enable both of these use cases at the same time:

  1. "KILL this execution if it fails to move into a RUNNING state within 1 hour after creation".
  2. "KILL this execution if it doesn't finish within 1 hour after creation".
@anna-geller
Copy link
Member Author

@loicmathieu we have a big prospect that needs this feature so bumped to P0 🙏

loicmathieu added a commit that referenced this issue Nov 13, 2024
@loicmathieu loicmathieu linked a pull request Nov 13, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/backend Needs backend code changes enhancement New feature or request kind/customer-request Requested by one or more customers
Projects
Status: Backlog
Development

Successfully merging a pull request may close this issue.

2 participants