Skip to content

ch4/ofi: tolerate transient CQ errors during MPI_Finalize#7763

Draft
Growl1234 wants to merge 2 commits intopmodels:mainfrom
Growl1234:main
Draft

ch4/ofi: tolerate transient CQ errors during MPI_Finalize#7763
Growl1234 wants to merge 2 commits intopmodels:mainfrom
Growl1234:main

Conversation

@Growl1234
Copy link
Copy Markdown
Contributor

Pull Request Description

When MPI_Finalize tears down OFI resources, fi_cq_read may return transient I/O errors (e.g. -FI_EIO) due to race conditions with endpoint/connection shutdown. Currently handle_cq_error treats ALL unexpected CQ errors as fatal, causing an abort even though all user communication has already completed successfully.

This patch:

  1. Adds an is_finalizing flag to the OFI global state.
  2. Downgrades fatal CQ errors to non-fatal when finalizing.
  3. Tolerates progress errors in the finalize drain/flush loops.

This should take care of and resolve #7760.

Author Checklist

  • Provide Description
    Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
  • Commits Follow Good Practice
    Commits are self-contained and do not do two things at once.
    Commit message is of the form: module: short description
    Commit message explains what's in the commit.
  • Passes All Tests
    Whitespace checker. Warnings test. Additional tests via comments.
  • Contribution Agreement
    For non-Argonne authors, check contribution agreement.
    If necessary, request an explicit comment from your companies PR approval manager.

MPIR_ERR_SETFATALANDJUMP2(mpi_errno, MPI_ERR_OTHER, "**ofid_poll",
"**ofid_poll %s %s",
MPIDI_OFI_DEFAULT_NIC_NAME, fi_strerror(errno));
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The abort behavior is controlled by error handler at the top of the call stack, in this case, MPI_Finalize. You can intercept the error return at any of the call stack layer before returning to MPI_Finalize, as in your following changes. There is no need for changes here.

@Growl1234 Growl1234 marked this pull request as draft April 10, 2026 05:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Failure in finalize

2 participants