-
Notifications
You must be signed in to change notification settings - Fork 777
Description
Bug report
Expected behavior and actual behavior
Expected: After a workflow completes successfully, the Nextflow JVM should exit cleanly, allowing the Azure Batch head task to terminate and release the node.
Actual: The JVM hangs indefinitely after workflow completion. The TowerClient.onFlowComplete() method blocks the main thread on an HTTP PUT call (sendHttpMessage(urlTraceComplete, ...)) to the Tower API that never receives a response. The JVM has been stuck for 35+ hours with 0% CPU. The workflow shows as COMPLETE on Seqera Platform but the Azure Batch task remains in active/running state forever.
Steps to reproduce the problem
Run any pipeline via Seqera Platform on Azure Batch. The issue is timing-dependent — it occurs when the Tower API connection becomes stale during the shutdown HTTP call. Reproduction requires a network condition where the TCP connection is established but the response is never delivered.
Program output
Last lines of Nextflow log — log goes silent after TimelineObserver, the next observer (TowerClient) never produces output:
Mar-03 00:33:40.068 [main] DEBUG nextflow.trace.WorkflowStatsObserver - Workflow completed
Mar-03 00:33:40.068 [main] DEBUG nextflow.trace.TimelineObserver - Workflow completed -- rendering execution timeline
<EOF — no further output, no "Session destroyed", no System.exit()>
Thread state from the stuck node (via /proc, 35+ hours after completion):
TID=4474 name=java state=S (sleeping) ← main thread blocked
TID=4821 name=tower-logs-chec state=S (sleeping) ← next observer never called
TID=4645 name=HttpClient-1-Se state=S (sleeping) ← Java HttpClient threads alive
(Tower-thread ABSENT — sender exited cleanly, sender.join() is not the issue)
Network state from the stuck node:
ESTAB 0 0 10.0.0.4:59344 → 13.41.18.99:443 (pid=4474) ← Tower API
No TCP keepalive timers, no retransmission timers — stale connection sitting idle for 35+ hours. Tower API health check responds 200 in 0.25s — the issue is specific to this stale connection.
Environment
- Nextflow version: 25.10.4 (build 11173)
- Java version: Amazon Corretto 21.0.10+7-LTS
- Operating system: Linux 6.8.0-1044-azure (Azure Batch)
- nf-tower plugin: 1.17.5
- nf-azure plugin: 1.20.2
Additional context
Root cause analysis:
TowerClient.onFlowComplete() calls sendHttpMessage(urlTraceComplete, req, 'PUT') to report workflow completion to the Tower API. The underlying HxClient is built with connectTimeout(60s) but no read timeout — Java HttpClient's default is infinite. When the Tower API accepts the TCP connection but never sends a response (stale/dead connection), the HTTP read blocks the main thread forever, System.exit() is never reached, and the JVM hangs indefinitely.
The shutdown observer chain is sequential with no timeout (notifyEvent() catches exceptions but cannot handle infinite blocking), so one stuck observer prevents all subsequent cleanup.
Suggested fixes:
- Add a read/response timeout to
HxClient(e.g..timeout(Duration.ofSeconds(300))) - Add a bounded timeout to the shutdown HTTP call in
onFlowComplete()so a stale connection can't block JVM exit - Consider enabling TCP keepalive to detect dead connections at the transport level
- Close
HxClienton shutdown (currently never closed)