Skip to content

Conversation

@aravneelaws
Copy link
Contributor

Issue: #896 #897

Description of changes:
This PR addresses the following issues:

  1. OpenZFS mounting issue in compute nodes
  2. .ssh permissions issue
  3. Adding users during cluster creation with correct permissions
  4. easy-ssh script updated with -u flag for logging as a specific user (non-ubuntu user)
  5. automate-cluster-creation script print statement updated for SSM preference if openZFS /home directory exists.

The changes has been tested on cluster with openZFS for /home and without openZFS (with only fsxL).

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@aravneelaws aravneelaws changed the title Aneelaka update lcs aravneel update lcs - openzfs mount and user add permissions Nov 13, 2025

retry_with_backoff $MAX_ATTEMPTS $INITIAL_BACKOFF "ansible localhost -b -m ansible.posix.mount -a \"path=$OPENZFS_MOUNT_POINT src=$FSX_OPENZFS_DNS_NAME:/fsx fstype=nfs opts=nfsvers=$NFS_VERSION,_netdev,nconnect=16,x-systemd.automount,x-systemd.requires=network-online.target dump=0 passno=0 state=mounted\""
echo "[STEP] Triggering automount..."
ls -la "$OPENZFS_MOUNT_POINT" >/dev/null 2>&1 || true

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a potential silent error due to (...|| true)
Do we need to retry ? can we at least log if there is issue ?

Example

echo "[STEP] Triggering automount..."
if ! ls -la "$OPENZFS_MOUNT_POINT" >/dev/null 2>&1; then
    echo "[WARN] Could not list mount directory contents"
    # log the issue
fi

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@PremiumSpider This does qualify as a silent error due to (..|| true). But we don't need to log anything here because this is just triggering the automount. The ls command is trying to access the space which will mount the FS. We have explicit verification in the next step with touch and delete. I would leave as it is in my opinion.

local delay=5

echo "[INFO] Ensuring $OPENZFS_MOUNT_POINT directory exists..."
ansible localhost -b -m ansible.builtin.file -a "path=$OPENZFS_MOUNT_POINT state=directory" || true

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suppose we have 2 nodes, if they are all creating a global static test_file, do we need to worry about race conditions ?

T1: Instance A creates test_file
T2: Instance B creates test_file (same name)
T3: Instance A deletes test_file  
T4: Instance B tries to delete test_file → FAILS (already gone)
T5: Instance B thinks mount is broken, retries entire process

Would something like below help ?

TEST_FILE="$OPENZFS_MOUNT_POINT/test_file_${HOSTNAME}_$$"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a good catch. I can implement this here. Thanks for catching this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@PremiumSpider This has been implemented now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested this as well and looks good.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for testing on your end as well.

@@ -1,53 +1,113 @@
#!/bin/bash

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it help if started adding Header comments such as

# SSH Key Setup Script for SageMaker HyperPod
# 
# Purpose: 
#   - Configures SSH keys for passwordless communication between cluster nodes
#   - Supports multiple users via shared_users.txt CSV file (username,uid,home)
#   - Creates symlinks between FSx Lustre and OpenZFS storage for unified SSH access
#
# Storage Strategy:
#   - Primary SSH keys stored on FSx Lustre (/fsx/username/.ssh)
#   - OpenZFS home directories (/home/username/.ssh) symlinked to FSx Lustre
#   - Ensures SSH keys persist and are accessible from both storage systems
#
# Multi-User Support:
#   - Always configures default 'ubuntu' user first
#   - Processes additional users from ../shared_users.txt if present
#   - Validates user existence before SSH setup
#   - Generates RSA 4096-bit keypairs with proper permissions (600/644/700)
#
# Cluster Behavior:
#   - Safe to run on multiple nodes simultaneously (|| true for key generation)
#   - Reuses existing keys if present, generates new ones if missing
#   - Adds public keys to authorized_keys for passwordless SSH
#

set -exuo pipefail

etc

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. But I am going to keep this for a separate PR to improve code documentation


echo -e "\n3. Change the Linux shell profile."
echo -e " It should have '/bin/bash -c 'export HOME=/fsx/\$(whoami) && cd \${HOME} && exec /bin/bash' in its first and only line"
if [[ "$ENABLE_FSX_OPENZFS" == "true" ]]; then
Copy link

@PremiumSpider PremiumSpider Nov 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi I might be missing context, but is this change mainly for when users log into HOME in any of the nodes ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. When the cluster is deployed only with fsx lustre, there won't be a /home dir and the HOME_DIR is set to /fsx/USER. When fsx for OpenZFS is used as /home dir, we have to explicitly call out this so the users can set the correct preference in SSM to successfully login as a USER.

@aravneelaws
Copy link
Contributor Author

Common tests that can be added to pipeline include:

  1. Checking if filesystem mounting was successful - /fsx/USER and /home/USER (for openZFS) - on all nodes (login/controller/compute).
  2. Checking the if .ssh directory exists with correct user permissions.
  3. Checking if key-pair and authorized_keys exists within .ssh directory.
  4. Checking all these functionality work for all USERs (in addition to the default ubuntu user).

These are the main tests that needs to be performed for the above PR. I have tested these on my end. These can be added to your LCS test pipeline as well.

Copy link

@PremiumSpider PremiumSpider left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hyperpod team mainly concerned with changes under LifecycleScripts but I've gone forward and reviewed other changes as well. Don't see any high risk changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants