generated from amazon-archives/__template_Custom
-
Notifications
You must be signed in to change notification settings - Fork 26
fix(test): fix AWS_IO_DNS_INVALID_NAME in multiprocess tests #376
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
jet-tong
wants to merge
2
commits into
main
Choose a base branch
from
fix/test-distributed-training-AWS_IO_DNS_INVALID_NAME-error
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
fix(test): fix AWS_IO_DNS_INVALID_NAME in multiprocess tests #376
jet-tong
wants to merge
2
commits into
main
from
fix/test-distributed-training-AWS_IO_DNS_INVALID_NAME-error
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
1 task
We can also consider changing behaviour on S3Client's fork handlers specifically on Darwin platform. |
0ff3ee2
to
c20d719
Compare
c20d719
to
be567ed
Compare
be567ed
to
bba17a2
Compare
bba17a2
to
833d973
Compare
Add multiprocessing_context=mp.get_context() to DataLoader calls to ensure spawn method is used instead of fork in Darwin, preventing S3 client fork handlers from corrupting AWS CRT DNS resolver threads on macOS github runners.
Use subprocess.run() instead of Popen() to mitigate race conditions in s3client fork handlers. A stronger fix might be needed if this still causes race conditions, as run() still uses fork multiprocessing.
833d973
to
9447e8a
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
multiprocessing_context=mp.get_context()
to force DataLoader to use the same context as the main process (i.e. spawn instead of fork on Darwin) for multiprocessing e2e tests,Alternatively, we can also consider changing behaviour on S3Client's fork handlers specifically on Darwin platform.
Additional context
Error Context
Tests were failing with
AWS_IO_DNS_INVALID_NAME
errors on macos-14 (ARM64) GitHub Actions runners during our Build Wheels workflow, particularly intest_distributed_training.py
. These errors were intermittent, across 5 macos-14 runners each running 40+ tests, typically 1-2 fails. We could resolve these failures by retrying the Build Wheels workflow 1-2 times for every run.The error re-appeared in PR #374. The error stack traces showed
popen_fork.py
being used, which was suspicious since we were explicitly settingtorch.multiprocessing.set_start_method('spawn')
, and disabled fork / forkserver methods for Darwin.Root Cause
PyTorch DataLoader source code (torch/utils/data/dataloader.py) showed that when
multiprocessing_context=None
(the default), DataLoader ignores the globaltorch.multiprocessing.set_start_method()
setting and uses its own default context. This means DataLoader was using fork internally, triggering the S3 client's fork handlers.When DataLoader forks workers, S3 client fork handlers (registered via
os.register_at_fork()
as a workaround for CRT multithreading issues in PR #320) calljoin_all_managed_threads()
to clean up AWS CRT background threads, but this corrupts active DNS resolver threads and causesAWS_IO_DNS_INVALID_NAME
errors.Related items
Testing
Build Wheels workflow: https://github.com/awslabs/s3-connector-for-pytorch/actions/runs/18225016406
By submitting this pull request, I confirm that my contribution is made under the terms of BSD 3-Clause License and I agree to the terms of the LICENSE.