Skip to content

Conversation

jet-tong
Copy link
Contributor

@jet-tong jet-tong commented Oct 3, 2025

Description

  • Add multiprocessing_context=mp.get_context() to force DataLoader to use the same context as the main process (i.e. spawn instead of fork on Darwin) for multiprocessing e2e tests,
  • Prevents S3 client fork handlers from corrupting AWS CRT DNS resolver threads on macOS GitHub runners.

Alternatively, we can also consider changing behaviour on S3Client's fork handlers specifically on Darwin platform.

Additional context

Error Context

Tests were failing with AWS_IO_DNS_INVALID_NAME errors on macos-14 (ARM64) GitHub Actions runners during our Build Wheels workflow, particularly in test_distributed_training.py. These errors were intermittent, across 5 macos-14 runners each running 40+ tests, typically 1-2 fails. We could resolve these failures by retrying the Build Wheels workflow 1-2 times for every run.

The error re-appeared in PR #374. The error stack traces showed popen_fork.py being used, which was suspicious since we were explicitly setting torch.multiprocessing.set_start_method('spawn'), and disabled fork / forkserver methods for Darwin.

Root Cause

PyTorch DataLoader source code (torch/utils/data/dataloader.py) showed that when multiprocessing_context=None (the default), DataLoader ignores the global torch.multiprocessing.set_start_method() setting and uses its own default context. This means DataLoader was using fork internally, triggering the S3 client's fork handlers.

When DataLoader forks workers, S3 client fork handlers (registered via os.register_at_fork() as a workaround for CRT multithreading issues in PR #320) call join_all_managed_threads() to clean up AWS CRT background threads, but this corrupts active DNS resolver threads and causes AWS_IO_DNS_INVALID_NAME errors.

  • I have updated the CHANGELOG or README if appropriate

Related items

Testing

Build Wheels workflow: https://github.com/awslabs/s3-connector-for-pytorch/actions/runs/18225016406


By submitting this pull request, I confirm that my contribution is made under the terms of BSD 3-Clause License and I agree to the terms of the LICENSE.

@jet-tong jet-tong temporarily deployed to integration-tests October 3, 2025 14:24 — with GitHub Actions Inactive
@jet-tong jet-tong temporarily deployed to integration-tests October 3, 2025 14:24 — with GitHub Actions Inactive
@jet-tong jet-tong temporarily deployed to integration-tests October 3, 2025 14:24 — with GitHub Actions Inactive
@jet-tong jet-tong temporarily deployed to integration-tests October 3, 2025 14:24 — with GitHub Actions Inactive
@jet-tong jet-tong temporarily deployed to integration-tests October 3, 2025 14:24 — with GitHub Actions Inactive
@jet-tong jet-tong temporarily deployed to integration-tests October 3, 2025 14:24 — with GitHub Actions Inactive
@jet-tong jet-tong temporarily deployed to integration-tests October 3, 2025 14:24 — with GitHub Actions Inactive
@jet-tong jet-tong temporarily deployed to integration-tests October 3, 2025 14:24 — with GitHub Actions Inactive
@jet-tong jet-tong temporarily deployed to integration-tests October 3, 2025 14:24 — with GitHub Actions Inactive
@jet-tong jet-tong temporarily deployed to integration-tests October 3, 2025 14:24 — with GitHub Actions Inactive
@jet-tong jet-tong temporarily deployed to integration-tests October 3, 2025 14:24 — with GitHub Actions Inactive
@jet-tong jet-tong temporarily deployed to integration-tests October 3, 2025 14:24 — with GitHub Actions Inactive
@jet-tong jet-tong temporarily deployed to integration-tests October 3, 2025 14:24 — with GitHub Actions Inactive
@jet-tong jet-tong temporarily deployed to integration-tests October 3, 2025 14:24 — with GitHub Actions Inactive
@jet-tong jet-tong temporarily deployed to integration-tests October 3, 2025 14:24 — with GitHub Actions Inactive
@jet-tong jet-tong temporarily deployed to integration-tests October 3, 2025 14:24 — with GitHub Actions Inactive
@jet-tong jet-tong temporarily deployed to integration-tests October 3, 2025 14:24 — with GitHub Actions Inactive
@jet-tong jet-tong temporarily deployed to integration-tests October 3, 2025 14:24 — with GitHub Actions Inactive
@jet-tong jet-tong temporarily deployed to integration-tests October 3, 2025 14:24 — with GitHub Actions Inactive
@jet-tong jet-tong temporarily deployed to integration-tests October 3, 2025 14:24 — with GitHub Actions Inactive
@jet-tong jet-tong temporarily deployed to integration-tests October 3, 2025 14:24 — with GitHub Actions Inactive
@jet-tong jet-tong temporarily deployed to integration-tests October 3, 2025 14:24 — with GitHub Actions Inactive
@jet-tong jet-tong temporarily deployed to integration-tests October 3, 2025 14:24 — with GitHub Actions Inactive
@jet-tong jet-tong temporarily deployed to integration-tests October 3, 2025 14:24 — with GitHub Actions Inactive
@jet-tong
Copy link
Contributor Author

jet-tong commented Oct 3, 2025

test_logging.py subprocess.Popen could also cause the error.

We can also consider changing behaviour on S3Client's fork handlers specifically on Darwin platform.

@jet-tong jet-tong force-pushed the fix/test-distributed-training-AWS_IO_DNS_INVALID_NAME-error branch from 0ff3ee2 to c20d719 Compare October 6, 2025 11:04
@jet-tong jet-tong temporarily deployed to integration-tests October 6, 2025 11:04 — with GitHub Actions Inactive
@jet-tong jet-tong force-pushed the fix/test-distributed-training-AWS_IO_DNS_INVALID_NAME-error branch from c20d719 to be567ed Compare October 7, 2025 11:34
@jet-tong jet-tong force-pushed the fix/test-distributed-training-AWS_IO_DNS_INVALID_NAME-error branch from be567ed to bba17a2 Compare October 7, 2025 13:20
@jet-tong jet-tong temporarily deployed to integration-tests October 7, 2025 13:20 — with GitHub Actions Inactive
@jet-tong jet-tong force-pushed the fix/test-distributed-training-AWS_IO_DNS_INVALID_NAME-error branch from bba17a2 to 833d973 Compare October 9, 2025 10:44
@jet-tong jet-tong temporarily deployed to integration-tests October 9, 2025 10:44 — with GitHub Actions Inactive
@jet-tong jet-tong temporarily deployed to integration-tests October 15, 2025 14:25 — with GitHub Actions Inactive
Add multiprocessing_context=mp.get_context() to DataLoader calls to ensure
spawn method is used instead of fork in Darwin, preventing S3 client fork
handlers from corrupting AWS CRT DNS resolver threads on macOS github runners.
Use subprocess.run() instead of Popen() to mitigate race conditions in s3client fork handlers.

A stronger fix might be needed if this still causes race conditions, as run() still uses fork multiprocessing.
@jet-tong jet-tong force-pushed the fix/test-distributed-training-AWS_IO_DNS_INVALID_NAME-error branch from 833d973 to 9447e8a Compare October 15, 2025 14:28
@jet-tong jet-tong temporarily deployed to integration-tests October 15, 2025 14:28 — with GitHub Actions Inactive
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant