Built with NixOS and Kubernetes.
When adding a new node, you need to create a bootable USB. To do that, you need to build an ISO file. If building with Apple Silicon Mac (or other non x86_64 architecture), see this post. Otherwise, follow the steps below.
cd iso
nix build .#nixosConfigurations.exampleIso.config.system.build.isoImageResulting ISO will be in the result directory. Then burn that ISO to a USB drive, then boot the new node from the USB drive.
Node secrets are managed with sops-nix.
On first boot, each node generates a key located at /etc/ssh/ssh_host_ed25519_key.pub. After the install, the key is converted to age and printed in the terminal. Copy this, and add it to the .sops.yaml file.
To create your own key for local development, generate an ssh key, and convert it to age:
nix-shell -p ssh-to-age --run 'cat /YOUR/KEY/PATH.pub | ssh-to-age'Then add the output to the .sops.yaml file.
To create a keys.txt for local secrets management, run the following command:
nix run nixpkgs#ssh-to-age -- -private-key -i ~/YOUR/KEY/PATH > keys.txtThis is needed if you're updating the secrets locally.
To update the new keys across all nodes, run the following command:
nix-shell -p sops --run "SOPS_AGE_KEY_FILE=./keys.txt sops updatekeys secrets/secrets.yaml"Then commit the changes to the .sops.yaml file and the nodes will be updated on their next rebuild.
Make sure you have nix installed locally. Then:
- Add the new node and its IP to the in
flake.nix. - Boot the node from the ISO created in Build NixOS ISO. Ensure that the node is reachable at
192.168.100.199. If you get permission errors, you may have to add your key to the ISO config file. - Execute the following command on your local machine:
SSH_PRIVATE_KEY="$(cat ./nixos_cluster)"$'\n' nix run github:nix-community/nixos-anywhere --extra-experimental-features "nix-command flakes" -- --flake '.#cluster-node-NUMBER' [email protected]- Once the node boots, ssh into the node and run the following command:
nix-shell -p ssh-to-age --run 'cat /etc/ssh/ssh_host_ed25519_key.pub | ssh-to-age'Copy the outputted age key to the .sops.yaml file and regenerate secrets (See Secrets Management), then update the node.
If you have the repository cloned on a node (working on changes without committing), then run to update from local source:
sudo nixos-rebuild switch --flake '.#cluster-node-NUMBER'Then to update each node in the cluster:
sudo nixos-rebuild switch --flake '.#cluster-node-NUMBER' --use-remote-sudo --target-host cluster@cluster-node-NUMBERThis will also update secrets on each node.
To pull new changes from the repository without cloning it onto the node, just run:
sudo nixos-rebuild switch --flake github:NelsonDane/clarkson-nixos-cluster#cluster-node-NUMBERAll nodes can ssh into each other using the included ssh_config. There is a key located in .sops.yaml that is available at /run/secrets/cluster_talk.
If you don't want to manually update each node, they pull and apply new changes from this repository every day at 3:30am.
A GitHub Action runs this everyday at 3am automatically.
For convenience, the following aliases are available when ssh'd into a node:
c -> clear
k -> kubectl
h -> helm
hf -> helmfileFor distributed storage, we use Longhorn. To install Longhorn, run the following command:
cd helm
hf applyTo see the gui, go to http://192.168.100.61 in your browser.
To get Metallb working (if it's not), run the following command:
cd helm/kustomize
kubectl apply -k .To see IPs:
k get svc -ASlurm is configured using the Slurm Helm Chart. To pull the submodules for Slurm, run the following command:
git submodule update --init --recursiveThen to install Slurm, run:
cd helm/slurm-k8s-cluster
h install slurm slurm-cluster-chartAnd then to apply changes after initial install, run:
h upgrade slurm slurm-cluster-chartThe Slurm GUI is available at https://192.168.100.82
To add a new app or service, find a helm chart and add it to helm/helmfile.yaml. Then run:
cd helm
hf applyAnd it will be installed on the cluster.