Skip to content

nf-tower plugin blocks JVM exit after workflow completion — missing HTTP read timeout #6885

@adamrtalbot

Description

@adamrtalbot

Bug report

Expected behavior and actual behavior

Expected: After a workflow completes successfully, the Nextflow JVM should exit cleanly, allowing the Azure Batch head task to terminate and release the node.

Actual: The JVM hangs indefinitely after workflow completion. The TowerClient.onFlowComplete() method blocks the main thread on an HTTP PUT call (sendHttpMessage(urlTraceComplete, ...)) to the Tower API that never receives a response. The JVM has been stuck for 35+ hours with 0% CPU. The workflow shows as COMPLETE on Seqera Platform but the Azure Batch task remains in active/running state forever.

Steps to reproduce the problem

Run any pipeline via Seqera Platform on Azure Batch. The issue is timing-dependent — it occurs when the Tower API connection becomes stale during the shutdown HTTP call. Reproduction requires a network condition where the TCP connection is established but the response is never delivered.

Program output

Last lines of Nextflow log — log goes silent after TimelineObserver, the next observer (TowerClient) never produces output:

Mar-03 00:33:40.068 [main] DEBUG nextflow.trace.WorkflowStatsObserver - Workflow completed
Mar-03 00:33:40.068 [main] DEBUG nextflow.trace.TimelineObserver - Workflow completed -- rendering execution timeline
<EOF — no further output, no "Session destroyed", no System.exit()>

Thread state from the stuck node (via /proc, 35+ hours after completion):

TID=4474   name=java               state=S (sleeping)    ← main thread blocked
TID=4821   name=tower-logs-chec    state=S (sleeping)    ← next observer never called
TID=4645   name=HttpClient-1-Se    state=S (sleeping)    ← Java HttpClient threads alive
(Tower-thread ABSENT — sender exited cleanly, sender.join() is not the issue)

Network state from the stuck node:

ESTAB 0 0  10.0.0.4:59344 → 13.41.18.99:443  (pid=4474)   ← Tower API

No TCP keepalive timers, no retransmission timers — stale connection sitting idle for 35+ hours. Tower API health check responds 200 in 0.25s — the issue is specific to this stale connection.

Environment

  • Nextflow version: 25.10.4 (build 11173)
  • Java version: Amazon Corretto 21.0.10+7-LTS
  • Operating system: Linux 6.8.0-1044-azure (Azure Batch)
  • nf-tower plugin: 1.17.5
  • nf-azure plugin: 1.20.2

Additional context

Root cause analysis:

TowerClient.onFlowComplete() calls sendHttpMessage(urlTraceComplete, req, 'PUT') to report workflow completion to the Tower API. The underlying HxClient is built with connectTimeout(60s) but no read timeout — Java HttpClient's default is infinite. When the Tower API accepts the TCP connection but never sends a response (stale/dead connection), the HTTP read blocks the main thread forever, System.exit() is never reached, and the JVM hangs indefinitely.

The shutdown observer chain is sequential with no timeout (notifyEvent() catches exceptions but cannot handle infinite blocking), so one stuck observer prevents all subsequent cleanup.

Suggested fixes:

  1. Add a read/response timeout to HxClient (e.g. .timeout(Duration.ofSeconds(300)))
  2. Add a bounded timeout to the shutdown HTTP call in onFlowComplete() so a stale connection can't block JVM exit
  3. Consider enabling TCP keepalive to detect dead connections at the transport level
  4. Close HxClient on shutdown (currently never closed)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions