-
Notifications
You must be signed in to change notification settings - Fork 151
aravneel update lcs - openzfs mount and user add permissions #899
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
|
||
| retry_with_backoff $MAX_ATTEMPTS $INITIAL_BACKOFF "ansible localhost -b -m ansible.posix.mount -a \"path=$OPENZFS_MOUNT_POINT src=$FSX_OPENZFS_DNS_NAME:/fsx fstype=nfs opts=nfsvers=$NFS_VERSION,_netdev,nconnect=16,x-systemd.automount,x-systemd.requires=network-online.target dump=0 passno=0 state=mounted\"" | ||
| echo "[STEP] Triggering automount..." | ||
| ls -la "$OPENZFS_MOUNT_POINT" >/dev/null 2>&1 || true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a potential silent error due to (...|| true)
Do we need to retry ? can we at least log if there is issue ?
Example
echo "[STEP] Triggering automount..."
if ! ls -la "$OPENZFS_MOUNT_POINT" >/dev/null 2>&1; then
echo "[WARN] Could not list mount directory contents"
# log the issue
fi
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@PremiumSpider This does qualify as a silent error due to (..|| true). But we don't need to log anything here because this is just triggering the automount. The ls command is trying to access the space which will mount the FS. We have explicit verification in the next step with touch and delete. I would leave as it is in my opinion.
| local delay=5 | ||
|
|
||
| echo "[INFO] Ensuring $OPENZFS_MOUNT_POINT directory exists..." | ||
| ansible localhost -b -m ansible.builtin.file -a "path=$OPENZFS_MOUNT_POINT state=directory" || true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suppose we have 2 nodes, if they are all creating a global static test_file, do we need to worry about race conditions ?
T1: Instance A creates test_file
T2: Instance B creates test_file (same name)
T3: Instance A deletes test_file
T4: Instance B tries to delete test_file → FAILS (already gone)
T5: Instance B thinks mount is broken, retries entire process
Would something like below help ?
TEST_FILE="$OPENZFS_MOUNT_POINT/test_file_${HOSTNAME}_$$"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a good catch. I can implement this here. Thanks for catching this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@PremiumSpider This has been implemented now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested this as well and looks good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for testing on your end as well.
| @@ -1,53 +1,113 @@ | |||
| #!/bin/bash | |||
|
|
|||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it help if started adding Header comments such as
# SSH Key Setup Script for SageMaker HyperPod
#
# Purpose:
# - Configures SSH keys for passwordless communication between cluster nodes
# - Supports multiple users via shared_users.txt CSV file (username,uid,home)
# - Creates symlinks between FSx Lustre and OpenZFS storage for unified SSH access
#
# Storage Strategy:
# - Primary SSH keys stored on FSx Lustre (/fsx/username/.ssh)
# - OpenZFS home directories (/home/username/.ssh) symlinked to FSx Lustre
# - Ensures SSH keys persist and are accessible from both storage systems
#
# Multi-User Support:
# - Always configures default 'ubuntu' user first
# - Processes additional users from ../shared_users.txt if present
# - Validates user existence before SSH setup
# - Generates RSA 4096-bit keypairs with proper permissions (600/644/700)
#
# Cluster Behavior:
# - Safe to run on multiple nodes simultaneously (|| true for key generation)
# - Reuses existing keys if present, generates new ones if missing
# - Adds public keys to authorized_keys for passwordless SSH
#
set -exuo pipefail
etc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. But I am going to keep this for a separate PR to improve code documentation
|
|
||
| echo -e "\n3. Change the Linux shell profile." | ||
| echo -e " It should have '/bin/bash -c 'export HOME=/fsx/\$(whoami) && cd \${HOME} && exec /bin/bash' in its first and only line" | ||
| if [[ "$ENABLE_FSX_OPENZFS" == "true" ]]; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi I might be missing context, but is this change mainly for when users log into HOME in any of the nodes ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct. When the cluster is deployed only with fsx lustre, there won't be a /home dir and the HOME_DIR is set to /fsx/USER. When fsx for OpenZFS is used as /home dir, we have to explicitly call out this so the users can set the correct preference in SSM to successfully login as a USER.
|
Common tests that can be added to pipeline include:
These are the main tests that needs to be performed for the above PR. I have tested these on my end. These can be added to your LCS test pipeline as well. |
PremiumSpider
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hyperpod team mainly concerned with changes under LifecycleScripts but I've gone forward and reviewed other changes as well. Don't see any high risk changes.
Issue: #896 #897
Description of changes:
This PR addresses the following issues:
The changes has been tested on cluster with openZFS for /home and without openZFS (with only fsxL).
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.