- 
                Notifications
    
You must be signed in to change notification settings  - Fork 81
 
Open
Labels
lifecycle/frozenIndicates that an issue or PR should not be auto-closed due to staleness.Indicates that an issue or PR should not be auto-closed due to staleness.
Description
Problems
The applier/destroyer pipelines add Apply/Prune tasks with Wait tasks, waiting for each apply/prune group to become Current (after apply) or NotFound (after prune). This behavior coupled with complex resource sets with multiple dependency branches has the following impact:
- apply/prune can be blocked for some object even if their dependencies are ready
 - blocking waits for the slowest reconciliation in the previous apply phase, even if it doesn't time out
 - because wait timeout currently causes the pipeline to terminate, any object that cause wait timeout blocks all objects in subsequent phases from being applied/pruned
 
Example 1
For example, here's an infra dependency chain with branches (just deploying two GKEs with a CC cluster):
- namespace
- rbac
 - GKE cluster1
- node pool1 (depends-on cluster1)
 
 - GKE cluster2
- node pool2 (depends-on cluster2)
 
 
 
If any cluster fails to apply (ex: error in config), then both node pools are blocked.
Example 2
Another example is just using the same apply for multiple namespaces or CRDs and namespace:
- CRD
 - namespace1
- deployment1
 
 - namespace2
- deployment2
 
 
If any CRD or any namespace fails to apply (ex: blocked by policy webhook or config error), then all the deployments and everything in the namespaces are blocked.
Possible solutions
Continue on wait timeout
- This helps with problem 1 and 3, but not problem 2, and has the consequence of making failure take longer, because all resources will be applied, even if we know their dependency isn't ready.
 
Dependency filter
- This would help remediate the side effect of "continue on wait timeout" by skipping apply/prune for objects with unreconciled dependencies
 - This would need to be used on both apply and prune
 - This would need to share logic with the graph sorter, which currently handles identifying dependencies (depends-on, apply-time-mutation, CRDs, and namespaces)
 
Parallel apply
- Applying objects in parallel during each Apply Task (and deleting in parallel for prune tasks) would speed up e2e apply time significantly, helping to mitigate dependency cost, but not actually solving either problem 1 or problem 2
 
Async (graph) apply
- Building a branching pipeline, instead of flattening into a single synchronous pipeline would help isolate dependencies to only block things that depend on them.
 - This is probably the best solution, but also the most complex and risky.
 - This would probably require changing how events are tested/used. Consuming them in a single for loop might not be the best strategy any more; async listeners might be better.
 - This would make both Dependency filter and Parallel apply obsolete.
 - This might also make Continue on wait timeout obsolete, because there might be more than one task branch executing and one could terminate even if the others continue.
 - This is the only proposal that solve problem 2, which is going to become a bigger problem as soon as people start using depends-on and apply-time-mutation on infra resources (like GKE clusters) that take a long time to reconcile.
 - This solution might make it easier to add support for advanced features like blocking on Job completion or lifecycle directives (ex: delete before apply)
 
Metadata
Metadata
Assignees
Labels
lifecycle/frozenIndicates that an issue or PR should not be auto-closed due to staleness.Indicates that an issue or PR should not be auto-closed due to staleness.