Skip to content

Refine Spring Boot startup message when restoring a Spring Boot application with CRaC #37084

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
sdeleuze opened this issue Aug 23, 2023 · 3 comments
Assignees
Labels
theme: checkpoint-restore Issues related to coordinated restore at checkpoint type: enhancement A general enhancement
Milestone

Comments

@sdeleuze
Copy link
Contributor

sdeleuze commented Aug 23, 2023

When using JVM Checkpoint/Restore with Spring Boot, I can see log messages like:

2023-08-23T06:53:41.292Z  INFO 405 --- [           main] o.s.c.support.DefaultLifecycleProcessor  : Restarting Spring-managed lifecycle beans after JVM restore
2023-08-23T06:53:41.294Z  INFO 405 --- [           main] o.s.c.support.DefaultLifecycleProcessor  : Spring-managed lifecycle restart completed in 2 ms
2023-08-23T06:53:41.340Z  INFO 405 --- [           main] o.s.b.w.embedded.tomcat.TomcatWebServer  : Tomcat started on port 8080 (http) with context path ''
2023-08-23T06:53:41.353Z  INFO 405 --- [           main] o.s.s.petclinic.PetClinicApplication     : Started PetClinicApplication in 330.329 seconds (process running for 330.592)

2 observations:

  • The very long delay seems to be computed leveraging the application startup during the checkpoint as a basis, so the delay printed can be huge since the application can be restored much later
  • Spring Framework already prints a CRaC related metric which is the time taken by the application context to restore

I guess we see this behavior due to the fact that CRaC support in Spring Framework is stopping and restarting the application context lifecycle and the related listener is still active.

I am wondering if it could be possible to either:

  • Provide a log message with the delay to restore the application (we may not have a relevant information for the process, but maybe for the JVM part we have)
  • If we can't find a way to print a relevant metric, do not display this confusing log message by making sure it is only printed one time at startup.

The first proposal would be my preferred option as it would allow use to compare cold startup and restoration time.

@spring-projects-issues spring-projects-issues added the status: waiting-for-triage An issue we've not yet triaged label Aug 23, 2023
@scottfrederick scottfrederick added type: enhancement A general enhancement status: pending-design-work Needs design work before any code can be developed theme: checkpoint-restore Issues related to coordinated restore at checkpoint and removed status: waiting-for-triage An issue we've not yet triaged labels Aug 23, 2023
@scottfrederick scottfrederick added this to the 3.x milestone Aug 23, 2023
@wilkinsona
Copy link
Member

I guess we see this behavior due to the fact that CRaC support in Spring Framework is stopping and restarting the application context lifecycle and the related listener is still active.

I don't think that's the cause as there's no listener involved here. The code's all in SpringApplication.run:

long startTime = System.nanoTime();
DefaultBootstrapContext bootstrapContext = createBootstrapContext();
ConfigurableApplicationContext context = null;
configureHeadlessProperty();
SpringApplicationRunListeners listeners = getRunListeners(args);
listeners.starting(bootstrapContext, this.mainApplicationClass);
try {
ApplicationArguments applicationArguments = new DefaultApplicationArguments(args);
ConfigurableEnvironment environment = prepareEnvironment(listeners, bootstrapContext, applicationArguments);
Banner printedBanner = printBanner(environment);
context = createApplicationContext();
context.setApplicationStartup(this.applicationStartup);
prepareContext(bootstrapContext, context, environment, listeners, applicationArguments, printedBanner);
refreshContext(context);
afterRefresh(context, applicationArguments);
Duration timeTakenToStartup = Duration.ofNanos(System.nanoTime() - startTime);
if (this.logStartupInfo) {
new StartupInfoLogger(this.mainApplicationClass).logStarted(getApplicationLog(), timeTakenToStartup);
}

I believe what's happening is:

  1. startTime is set to System.nanoTime()
  2. The context is refreshed which causes the checkpoint to be taken
  3. The checkpoint is restored some time later and processing continues
  4. timeTakenToStartup is determined based on the original startTime
  5. The startup info is logged

To behave differently, I think we'd need SpringApplication to be become aware of CRaC and to somehow determine that this was a restore rather than a normal startup.

@sdeleuze
Copy link
Contributor Author

This is not yet exposed via the org.crac API, but shortly this CRaCMXBean API should be available and could allow:

  • To know if there as been a restoration by testing if CRaCMXBean.getRestoreTime() == -1 (protected by a classpath check)
  • Get an honest estimation of the time to restore the application, and display that value in Spring Boot instead of the startup time when the application is restored.

If that's considered as a good fit by the Boot team, I can comment here as soon as it is available via the CRaC dependency.

@sdeleuze
Copy link
Contributor Author

sdeleuze commented Sep 8, 2023

This API, documented here, is now available as part of org.crac:crac:1.4.0 which Spring Framework now leverages so Spring Boot should be able to leverage it.

We can potentially add related checks in https://github.com/spring-projects/spring-checkpoint-restore-smoke-tests once there is a branch of Boot that leverages it.

@wilkinsona wilkinsona self-assigned this Sep 15, 2023
@wilkinsona wilkinsona removed the status: pending-design-work Needs design work before any code can be developed label Oct 19, 2023
@wilkinsona wilkinsona modified the milestones: 3.x, 3.2.x, 3.2.0-RC1 Oct 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
theme: checkpoint-restore Issues related to coordinated restore at checkpoint type: enhancement A general enhancement
Projects
None yet
Development

No branches or pull requests

4 participants