Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(flow): consider to fail a parent not in waiting-children #3098

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

roggervalf
Copy link
Collaborator

@roggervalf roggervalf commented Feb 22, 2025

Why

  1. Why is this change necessary? When failing a parent we are only considering that it can only be in waiting-children. If it's in another state we do not fail it. This is in the case where failParentOnFailure is true. But we can dynamically add jobs when a parent is active. We should consider to fail a parent later.

How

  1. How did you implement this? A child can be in 4 different stages:
  • unprocessed - dependencies key
  • processed successfully - processed key
  • processed unsuccessfully and ignored - failed key
  • processed unsuccessfully and not ignored - (new key being added as unsuccessful)

When a child with failParentOnFailure is true and this child fail. As it's required to pass a job to waiting-children to verify that all children are completed to continue, we can validate if unsuccessful key has a len greater than 0 to move it to failed instead of waiting-children. Take in count that if users do not use moveToWaitingChildren method, parent would fail anyway as not all children will be completed.

Additional Notes (Optional)

Any extra info here.

@roggervalf roggervalf changed the title fix(flow): consider to fail a parent not in waiting-children fix(flow): consider to fail a parent not in waiting-children [DRAFT] Feb 22, 2025
@roggervalf roggervalf changed the title fix(flow): consider to fail a parent not in waiting-children [DRAFT] fix(flow): consider to fail a parent not in waiting-children Feb 22, 2025
Copy link
Contributor

@manast manast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@manast
Copy link
Contributor

manast commented Mar 3, 2025

Code Review for BullMQ PR

Why is this Change Necessary?

In the current implementation of BullMQ, when a child job fails with the failParentOnFailure option set to true, the parent job is only failed if it is in the "waiting-children" state. This limitation poses a problem because parent jobs can dynamically add child jobs while in other states, such as "active." If a child job fails in such scenarios, the parent job does not fail as expected, leading to inconsistent behavior. This change is necessary to ensure that a parent job fails when a child with failParentOnFailure set to true fails, regardless of the parent's current state. This improves the reliability and correctness of job failure handling in BullMQ, addressing cases where dynamic job addition occurs.

How Was This Implemented?

To address this issue, the PR introduces a new mechanism to track failed child jobs and fail the parent accordingly. Here's how it was implemented:

  1. Introduction of the unsuccessful Set:

    • A new Redis hash set, unsuccessful, is added for each job to track child jobs that have failed with failParentOnFailure set to true. This set complements the existing keys that track child job states:
      • Unprocessed: Stored in the dependencies key.
      • Processed Successfully: Stored in the processed key.
      • Processed Unsuccessfully and Ignored: Stored in the failed key.
      • Processed Unsuccessfully and Not Ignored: Stored in the new unsuccessful key.
  2. Tracking Failed Children:

    • When a child job fails and failParentOnFailure is true, it adds itself to the parent's unsuccessful set. This is implemented in scripts like moveToFinished-14.lua and moveStalledJobsToWait-9.lua.
  3. Failing the Parent:

    • The moveToWaitingChildren-8.lua script, which replaces the older version 5, now checks the unsuccessful set before moving a job to the "waiting-children" state. If the set has any entries (i.e., its length is greater than 0), the script fails the parent job by moving it to the "failed" state instead of "waiting-children." This ensures that the parent fails even if it was in a state like "active" when the child failed.
    • For users not explicitly calling moveToWaitingChildren, the parent would still fail eventually because not all children will complete successfully, but this change makes the failure explicit and immediate when attempting to transition to "waiting-children."
  4. Key Changes in the Code:

    • Python (scripts.py):
      • Updated moveToWaitingChildren to use version 8 of the script.
      • Modified moveToWaitingChildrenArgs to include new keys: dependencies, unsuccessful, failed, and events.
    • TypeScript (scripts.ts):
      • Updated moveToWaitingChildrenArgs to include the same new keys, ensuring consistency with the Lua script's expectations.
    • Lua Scripts:
      • moveToWaitingChildren-8.lua:
        • Checks jobUnsuccessfulKey length (HLEN) and fails the job if non-zero, adding it to the failed key with a reason of "children are failed."
        • Propagates failure to grandparents if fpof is true for the parent.
      • moveToFinished-14.lua:
        • Adds failed child to the parent's unsuccessful set when fpof is true.
      • moveStalledJobsToWait-9.lua:
        • Similarly updates the unsuccessful set for stalled jobs.
      • moveParentFromWaitingChildrenToFailed.lua:
        • Propagates failure to grandparents by adding the parent to the grandparent's unsuccessful set.
      • removeJobKeys.lua:
        • Deletes the unsuccessful key during job removal to prevent memory leaks.
      • removeJob-3.lua:
        • Recursively removes jobs listed in the unsuccessful set when removing a parent job.
  5. Test Case:

    • A new test in test_flow.ts verifies the behavior:
      • Adds a parent job with steps (Initial, Second, Third, Finish).
      • Dynamically adds a child with failParentOnFailure: true in the Initial step.
      • Fails the child job in a separate queue.
      • In the Third step, attempts to move the parent to "waiting-children," expecting it to fail due to the unsuccessful child.
      • Validates that the parent fails with the reason "children are failed" and transitions from "active" to "failed."

Additional Notes

  • Consistency and Cleanup:

    • The unsuccessful set is cleaned up in removeJob-3.lua and removeJobKeys.lua, ensuring no stale entries remain after job removal.
    • The failure propagation to grandparents ensures consistency in multi-level job hierarchies when fpof is enabled.
  • Performance Considerations:

    • Adding the unsuccessful set increases memory usage slightly for jobs with failed children. However, Redis handles hash sets efficiently, and BullMQ is designed for large-scale job processing, so the impact should be minimal.
  • Edge Cases:

    • The implementation handles cases where the parent is active during child failure. However, additional tests could verify scenarios like simultaneous child failures or state transitions post-failure.
  • Versioning:

    • The script version jumped from 5 to 8 for moveToWaitingChildren, indicating significant changes. The keys and arguments in Python and TypeScript match the Lua script's expectations (8 keys, 5 arguments), ensuring compatibility.

Conclusion

This PR effectively addresses the limitation in BullMQ's failure handling by introducing the unsuccessful set to track failed children and failing the parent when transitioning to "waiting-children" if any children have failed with failParentOnFailure set to true. The changes are consistent across Lua, Python, and TypeScript, and the test case validates the new behavior. This enhancement makes BullMQ more robust for dynamic job workflows, ensuring parent jobs fail as expected regardless of their state.

@manast
Copy link
Contributor

manast commented Mar 3, 2025

Are we sure we want to use a plain SET instead of a ZSET for the new unsuccessful state?

@roggervalf
Copy link
Collaborator Author

Are we sure we want to use a plain SET instead of a ZSET for the new unsuccessful state?

this is something I though in the begining but them in case of inspecting why one of this child failed we should get child record. I though it can be handled similar to get ignored failures where it's also a set that saves these failures to be easily inspected. Something to take in count is that if child is auto removed, failure will still be saved in this set, so a parent. Wdyt? Later I can add a method to retrieve these failures in a paginated way

@roggervalf
Copy link
Collaborator Author

after an internal conversation we are trying the approach of using a zset for saving those unsuccessful jobs for now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants