fix: ensure node state persisted before shutdown #743

claude · 2026-02-02T13:30:45Z

Issue: Setting state to Stopped when JNI operations may still be running

When a timeout-induced CancellationException occurs, this code forces the state to NodeLifecycleState.Stopped. However, the native node.syncWallets() and node.stop() JNI calls may still be executing on the ServiceQueue.LDK single-threaded dispatcher. JNI calls are not interruptible by coroutine cancellation - they will continue running until completion.

Consequences:

State machine corruption: The lifecycle state says Stopped but the node is still actively performing I/O, persisting state, etc.

Mutex release allows concurrent operations: When return@withLock executes, the lifecycleMutex is released. If start() is called next (e.g., on app relaunch), it will:

Acquire the mutex

See state Stopped

Proceed with startup

Find lightningService.node is non-null (since line 259 in LightningService.kt hasn't executed yet)

Skip setup() and dispatch node.start() to the LDK queue

The still-running stop() block eventually completes and sets [email protected] = null

Result: Repo thinks node is Running, but service's node reference is null

Recovery mechanism doesn't help: The stuck-Stopping recovery at lines 282-286 is never triggered in the timeout scenario because state was already forced to Stopped.

Possible solutions:

Leave state as Stopping instead of forcing to Stopped - this allows the stuck-state recovery mechanism to handle it on next start()

Add a flag in LightningService to track whether native operations are truly complete, independent of coroutine cancellation

Rethink the timeout approach - consider whether the timeout should apply to the entire operation or just serve as a best-effort mechanism with acknowledgment that native cleanup continues in background

claude · 2026-02-02T13:30:18Z

Issue: Catching CancellationException without rethrowing breaks coroutine cancellation contract

This code catches CancellationException via runCatching and wraps it in Result.failure() without rethrowing. This violates Kotlin's coroutine cancellation contract, which requires CancellationException to always be rethrown for proper structured concurrency.

The same file demonstrates the correct pattern in multiple places:

Line 223: // Cancellation is expected during pull-to-refresh, rethrow per Kotlin best practices followed by if (it is CancellationException) throw it

Line 861: Same pattern

Impact: While this happens to work for the primary onDestroy call site (since withContext(bgDispatcher) may detect parent cancellation independently), stop() is also called from:

wipeStorage() (line 543)

restartWithElectrumServer() (line 562)

restartWithRgsServer() (line 588)

restartWithPreviousConfig() (line 609)

restartNode() (line 1076)

If any of these call sites' coroutines are cancelled while stop() is running, the CancellationException will be silently swallowed, and subsequent code will execute when it should not.

Suggested fix: Rethrow the CancellationException after performing the state recovery:

Suggested change

// On cancellation (e.g., timeout), ensure state is recoverable

if (it is CancellationException) {

Logger.warn("Node stop cancelled, forcing Stopped state for recovery", context = TAG)

_lightningState.update { LightningState(nodeLifecycleState = NodeLifecycleState.Stopped) }

return@withLock Result.failure(it)

// On cancellation (e.g., timeout), ensure state is recoverable

if (it is CancellationException) {

Logger.warn("Node stop cancelled, forcing Stopped state for recovery", context = TAG)

_lightningState.update { LightningState(nodeLifecycleState = NodeLifecycleState.Stopped) }

throw it // Rethrow to properly propagate cancellation

}

-Original file line number
+Diff line change
@@ Expand Up / @@ -15,6 +15,8 @@ import kotlinx.coroutines.CoroutineScope @@
     import kotlinx.coroutines.SupervisorJob
     import kotlinx.coroutines.cancel
     import kotlinx.coroutines.launch
+    import kotlinx.coroutines.runBlocking
+    import kotlinx.coroutines.withTimeoutOrNull
     import org.lightningdevkit.ldknode.Event
     import to.bitkit.App
     import to.bitkit.R
@@ Expand Down Expand Up / @@ -144,21 +146,34 @@ class LightningNodeService : Service() { @@
         }
         override fun onDestroy() {
-            Logger.debug("onDestroy", context = TAG)
-            serviceScope.launch {
-                lightningRepo.stop()
-                serviceScope.cancel()
+            Logger.debug("onDestroy started", context = TAG)
+            runBlocking {
+                withTimeoutOrNull(NODE_STOP_TIMEOUT_MS) {
+                    lightningRepo.stop()
+                }.let { result ->
+                    if (result == null || result.isFailure) {
+                        Logger.warn("Node stop timed out or failed during onDestroy", context = TAG)
+                    }
+                }
             }
+            serviceScope.cancel()
+            Logger.debug("onDestroy completed", context = TAG)
             super.onDestroy()
         }
         @RequiresApi(Build.VERSION_CODES.VANILLA_ICE_CREAM)
         override fun onTimeout(startId: Int, fgsType: Int) {
             Logger.warn("Foreground service timeout reached", context = TAG)
-            serviceScope.launch {
-                lightningRepo.stop()
-                stopSelf()
+            runBlocking {
+                withTimeoutOrNull(FORCE_STOP_TIMEOUT_MS) {
+                    lightningRepo.stop()
+                }.let { result ->
+                    if (result == null || result.isFailure) {
+                        Logger.warn("Node stop timed out or failed during onTimeout", context = TAG)
+                    }
+                }
             }
+            stopSelf()
             super.onTimeout(startId, fgsType)
         }
@@ Expand All / @@ -168,5 +183,7 @@ class LightningNodeService : Service() { @@
             const val CHANNEL_ID_NODE = "bitkit_notification_channel_node"
             const val TAG = "LightningNodeService"
             const val ACTION_STOP_SERVICE_AND_APP = "to.bitkit.androidServices.action.STOP_SERVICE_AND_APP"
+            private const val NODE_STOP_TIMEOUT_MS = 5_000L
+            private const val FORCE_STOP_TIMEOUT_MS = 2_000L
         }
     }

-Original file line number
+Diff line change
@@ Expand Up / @@ -246,6 +246,12 @@ class LightningService @Inject constructor( @@
                 return
             }
+            runCatching {
+                Logger.debug("Performing final sync before shutdown…", context = TAG)
+                ServiceQueue.LDK.background { node.syncWallets() }
+                Logger.debug("Final sync completed", context = TAG)
+            }.onFailure { Logger.warn("Final sync failed, proceeding with shutdown", it, context = TAG) }
             Logger.debug("Stopping node…", context = TAG)
             ServiceQueue.LDK.background {
                 runCatching { node.stop() }
@@ Expand Down @@

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: ensure node state persisted before shutdown #743

Diff view

Diff view

There are no files selected for viewing

Uh oh!

claude bot Feb 2, 2026

Uh oh!

claude bot Feb 2, 2026

Uh oh!

Uh oh!

Uh oh!

fix: ensure node state persisted before shutdown #743

Are you sure you want to change the base?

fix: ensure node state persisted before shutdown #743

Uh oh!

Uh oh!

Diff view

Diff view

There are no files selected for viewing

Uh oh!

claude bot Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!