Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LifecycleGroup concurrent start and start timeout #34634

Closed
fralalonde opened this issue Mar 21, 2025 · 11 comments
Closed

LifecycleGroup concurrent start and start timeout #34634

fralalonde opened this issue Mar 21, 2025 · 11 comments
Assignees
Labels
in: core Issues in core modules (aop, beans, core, context, expression) type: enhancement A general enhancement
Milestone

Comments

@fralalonde
Copy link

fralalonde commented Mar 21, 2025

Context

Our app communicates with multiple (physical) machines, each being driven by a separate bean. When the app starts, each driver bean resets its machine to a known state. This reset can take a few minutes before which the rest of the app (UI, etc) should not be available. The driver beans implement SmartLifecycle, hardware reset is performed on start().

All machines can be reset at the same time, so to minimize startup time I'm currently using a custom LifecycleProcessor forked off the DefaultLifecycleProcessor , for which I've implemented concurrent start for beans within the same phase (in LifecycleGroup::start()). This makes the LifecycleGroup start take as long as the slowest machine's reset, rather than sequentially resetting each one which would be unnecessarily slow.

Moreover, I've duplicated the existing per-phase (stop) timeout parameter to a start timeout, to make sure the app fails if starting takes more than some expected limit rather than remain in start limbo forever.

Proposal

My main assumption is that situations where Lifecycle beans depend on external resources happen frequently, not just in our case. Spring applications could gain both in performance and resilience by making it easier to use concurrent start / stop operations.

To do so would involve either:

  • Integrate concurrent operation + timeout in the existing DefaultLifecycleProcessor, using an opt-in switch per phase (default behavior would remain sequential start with no timeout) (e.g. .setConcurrentStart(phase, true))

OR

  • Make it possible to override the LifecycleGroup class with a custom class to allow implementing aforementioned concurrent start mechanism without having to duplicate the entire DefaultLifecycleProcessor.

Thoughts

  • Beans within the same phase should be safe to start concurrently, since by design they should not have inter-dependencies. While I know this is true for my app, this is not a part of the current Lifecycle contract and thus the default should remain sequential, non-timeout start.

  • The underlying support for concurrency could vary a lot from case to case (virtual threads? limit number of concurrent processes?), so maybe the extendable DefaultLifecycleProcessor option is more interesting.

@spring-projects-issues spring-projects-issues added the status: waiting-for-triage An issue we've not yet triaged or decided on label Mar 21, 2025
@jhoeller jhoeller self-assigned this Mar 21, 2025
@jhoeller jhoeller added the in: core Issues in core modules (aop, beans, core, context, expression) label Mar 21, 2025
@jhoeller
Copy link
Contributor

On a related note: As of 6.2, the framework comes with a first-class option for background bootstrapping: #19487. However, that's at the bean creation level, not at the lifecycle start/stop level which comes as a final step in application startup. That background bootstrap mechanism is rather about using multiple CPU cores for expensive configuration and bean construction efforts on startup.

It's interesting to see that there is also a benefit in concurrent invocations of Lifecycle.start. Our assumption was that start() would usually be implemented in a quick enough fashion, doing most of the work asynchronously. Whereas in your scenario, the start() calls actually seem to block, so I assume that the component-level start() implementations do not hand off anything to a thread pool themselves? In your custom LifecycleProcessor variant, are you using a processor-level thread pool to manage concurrent calls to component-level start() implementations and eventually joining for their collective completion (per lifecycle group)?

@fralalonde
Copy link
Author

I've considered it, but wouldn't asynchronous initialization defeat the purpose of having Lifecycle phases? I assume that if component B initializes in a later phase than component A, it can rely on it having completed it's initialization, which wouldn't be guaranteed without a callback from A saying it's done starting() (...or failed) or the LifecycleProcessor polling beans of a phase until they're all running() before moving on to the next phase.

My goal in binding bean initialization to the context's lifecycle is to guarantee the integrity of the app. If a bean fails to start() (or takes too long), the context fails to refresh(), making the error obvious rather than having to deal with dirty intermediate states. The blocking behavior of start() is a benefit because it prevents accessing the app before the system is fully up.

I was thinking that "ready-flag" bean could be declared as the sole bean of the last lifecycle phase and serve as an indicator to a RequestFilter that the system is now in a stable state. This way the webserver would be accessible early, giving out polite error messages until the context init either completes, fails or times out.

@jhoeller
Copy link
Contributor

Sounds sensible. I don't think there is anything wrong with this approach, and we can certainly open up DefaultLifecycleProcessor for better extensibility and consider concurrent starting as a first-class option.

Can you elaborate a bit on the behavior that you are currently implementing in a custom LifecycleProcessor variant?

@fralalonde
Copy link
Author

fralalonde commented Mar 21, 2025

Nothing fancy, I copied the existing DefaultLifecycleProcessor as a basis, then stripped out the CRaC stuff because I don't have the required dependencies.

The core of my changes are in ConcurrentLifecycleProcessor.LifecycleGroup::start()

        public void start() {
            if (this.members.isEmpty()) {
                return;
            }
            if (logger.isDebugEnabled()) {
                logger.debug("Starting beans in phase " + this.phase);
            }

            List<CompletableFuture<Void>> startFutures = this.members.stream()
                    .map(member -> CompletableFuture.runAsync(() -> doStart(this.lifecycleBeans, member.name, this.autoStartupOnly)))
                    .toList();

            try {
                CompletableFuture.allOf(startFutures.toArray(new CompletableFuture[0]))
                        .get(startTimeout, TimeUnit.MILLISECONDS);
            } catch (TimeoutException e) {
                throw new IllegalStateException("Timeout exceeded starting beans in phase " + this.phase, e);
            } catch (InterruptedException | ExecutionException e) {
                throw new IllegalStateException("Error starting beans in phase " + this.phase, e);
            }
        }

The only changes to the containing ConcurrentLifecycleProcessor class are about the now-split start/stop timeout config fields:

    private final Map<Integer, Long> timeoutsForStartPhases = new ConcurrentHashMap<>();

    private volatile long timeoutPerStartPhase = 10000;

    private final Map<Integer, Long> timeoutsForShutdownPhases = new ConcurrentHashMap<>();

    private volatile long timeoutPerShutdownPhase = 10000;

 [...]

    private long determineStartTimeout(int phase) {
        Long timeout = this.timeoutsForStartPhases.get(phase);
        return (timeout != null ? timeout : this.timeoutPerStartPhase);
    }

    private long determineStopTimeout(int phase) {
        Long timeout = this.timeoutsForShutdownPhases.get(phase);
        return (timeout != null ? timeout : this.timeoutPerShutdownPhase);
    }

Keep in mind this is code from my proof on concept and so may not cover all the bases. It ran ok when I tested it, both beans started() at the same time and total init time was equivalent to the longest delay of the two beans. I'm comfortable enough with it to ship it as part of the app but I'm conscious it may not be up to the standards of a reusable library.

Notably missing here are the boolean flags to enable parallel operation per phase, which I do not require.

Cursory analysis makes me think that making DefaultLifecycleProcessor itself extensible might not be possible because of its LifecycleGroup private "associated" class, which is where the core changes go. Opening up the mechanism may require a parent AbstractLifecycleProcessor of some sort as it is done elsewhere.

I'm happy to keep discussing this but I'm sure you have preferences about how this should be done so I'll let you go first.

@jhoeller jhoeller added type: enhancement A general enhancement and removed status: waiting-for-triage An issue we've not yet triaged or decided on labels Mar 21, 2025
@jhoeller jhoeller modified the milestones: 7.0.x, 7.0.0-M4 Mar 21, 2025
fralalonde added a commit to fralalonde/spring-framework that referenced this issue Mar 25, 2025
Includes a demonstration ConcurrentLifecycleProcessor class

As per spring-projectsgh-34634
fralalonde added a commit to fralalonde/spring-framework that referenced this issue Mar 25, 2025
Includes a demonstration ConcurrentLifecycleProcessor class

As per spring-projectsgh-34634
@fralalonde
Copy link
Author

A proof of concept about opening up DefaultLifecycleProcessor enough for a derivative ConcurrentLifecycleProcessor to be implemented. The changes were less involved than I thought.

Note that this is not a proposition to make ConcurrentLifecycleProcessor official in any way, it is included here for demonstration purposes only.

@jhoeller
Copy link
Contributor

jhoeller commented Apr 1, 2025

On review, I'm actually in favor of adding this capability to DefaultLifecycleProcessor itself. I've introduced setTimeout(s)ForStartupPhase(s) which allows for opting into a concurrent startup phase with a timeout rather than the sequential startup without a timeout. -1 is the default and indicates the latter. Specifying a concrete timeout for a given phase implies that the beans in that phase will be started concurrently, cancelling the startup if the timeout is not met for that phase.

There is intentionally no general setTimeoutPerStartupPhase method since most phases would not benefit from asynchronous bootstrapping. In a typical application, there will be TaskExecutor/TaskScheduler beans or DefaultMessageListenerContainer beans, all of which implement SmartLifecycle but with very quick start/stop callbacks. I would rather expect custom expensive startup beans to be grouped into a certain phase or certain phases, with only those phases configured for concurrent startup.

One more important note: We never use the JVM's default common pool in framework facilities. As a consequence, for an actual concurrent startup, a bootstrap Executor needs to be set for the application context, typically through a "bootstrapExecutor" bean. This is the same executor that is also used for Spring Framework 6.2's background bootstrapping at the bean level.

I assume this would work for your purposes? It should also make sense for applications that are fine-tuning their concurrent startup through Spring Framework 6.2's background bootstrapping already, letting them opt into concurrency for certain lifecycle phases as well. As straightforward as I have it here, I'd be happy to roll this into our 6.2.6 release in April (making it into Boot 3.5).

@jhoeller jhoeller modified the milestones: 7.0.0-M4, 6.2.6 Apr 1, 2025
@fralalonde
Copy link
Author

Yes, this would certainly work for us and seems general purpose enough to be useful to others. Reusing the bootstrap executor makes sense.

One note: The main feature here is (IMO) the parallel startup, with the timeout being an added safety constraint. I might reflect this in the method name, e.g. setConcurrentStartForPhase(timeout) (or similar). This would highlight the difference from the existing sequential stopTimeout, which might otherwise create expectations of "symmetricality". Then again, this is mere cosmetics and I will be happy to use the feature no matter its name.

Thank you.

@jhoeller
Copy link
Contributor

jhoeller commented Apr 1, 2025

Good point, trying to model this after the shutdown phase timeouts creates a bit of a false impression since it is not actually symmetric indeed (as also indicated by the lack of a setTimeoutPerStartupPhase method). I've switched to setConcurrentStartupForPhase(int phase, long timeout) and setConcurrentStartupForPhases(Map<Integer, Long>) accordingly.

@jhoeller
Copy link
Contributor

jhoeller commented Apr 2, 2025

@fralalonde feel free to give this an early try against the latest 6.2.6 snapshot, just to make sure that it will actually work for your scenario...

@fralalonde
Copy link
Author

@jhoeller After testing, I confirm this new version fulfills my expectations:

  • Lifecycle beans of the designated phase start concurrently
  • Context refresh() throws an exception if the timeout is exceeded
  • A custom LifecycleProcessor is no longer required 👍

@jhoeller
Copy link
Contributor

jhoeller commented Apr 9, 2025

Great to hear! Thanks for the pre-release testing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
in: core Issues in core modules (aop, beans, core, context, expression) type: enhancement A general enhancement
Projects
None yet
Development

No branches or pull requests

3 participants