aws-samples
diff --git a/‎README.md
Lines changed: 45 additions & 69 deletions b/‎README.md
Lines changed: 45 additions & 69 deletions
diff --git a/‎docs/Master.png renamed to ‎docs/HeadNode.png b/‎docs/Master.png renamed to ‎docs/HeadNode.png
diff --git a/‎docs/Login1.png
342 KB b/‎docs/Login1.png
342 KB
diff --git a/‎docs/Login2.png
2.58 MB b/‎docs/Login2.png
2.58 MB
diff --git a/‎parallelcluster-setup/install-monitoring.sh
Lines changed: 1 addition & 1 deletion b/‎parallelcluster-setup/install-monitoring.sh
Lines changed: 1 addition & 1 deletion
diff --git a/‎post-install.sh
Lines changed: 7 additions & 5 deletions b/‎post-install.sh
Lines changed: 7 additions & 5 deletions
@@ -3,9 +3,9 @@
 This is a sample solution based on Grafana for monitoring various component of an HPC cluster built with AWS ParallelCluster.
 There are 6 dashboards that can be used as they are or customized as you need.
 * [ParallelCluster Summary](https://github.com/aws-samples/aws-parallelcluster-monitoring/blob/main/grafana/dashboards/ParallelCluster.json) - this is the main dashboard that shows general monitoring info and metrics for the whole cluster. It includes Slurm metrics and Storage performance metrics.
-* [Master Node Details](https://github.com/aws-samples/aws-parallelcluster-monitoring/blob/main/grafana/dashboards/master-node-details.json) - this dashboard shows detailed metric for the Master node, including CPU, Memory, Network and Storage usage.
+* [HeadNode Details](https://github.com/aws-samples/aws-parallelcluster-monitoring/blob/main/grafana/dashboards/master-node-details.json) - this dashboard shows detailed metric for the HeadNode, including CPU, Memory, Network and Storage usage.
 * [Compute Node List](https://github.com/aws-samples/aws-parallelcluster-monitoring/blob/main/grafana/dashboards/compute-node-list.json) - this dashboard show the list of the available compute nodes. Each entry is a link to a more detailed page.
-* [Compute Node Details](https://github.com/aws-samples/aws-parallelcluster-monitoring/blob/main/grafana/dashboards/compute-node-details.json) - similarly to the master node details this dashboard show the same metric for the compute nodes.
+* [Compute Node Details](https://github.com/aws-samples/aws-parallelcluster-monitoring/blob/main/grafana/dashboards/compute-node-details.json) - similarly to the HeadNode details this dashboard show the same metric for the compute nodes.
 * [GPU Nodes Details](https://github.com/aws-samples/aws-parallelcluster-monitoring/blob/main/grafana/dashboards/gpu.json) - This dashboard shows GPUs releated metrics collected using nvidia-dcgm container.
 * [Cluster Logs](https://github.com/aws-samples/aws-parallelcluster-monitoring/blob/main/grafana/dashboards/logs.json) - This dashboard shows all the logs of your HPC Cluster. The logs are pushed by AWS ParallelCluster to AWS ClowdWatch Logs and finally reported here.
 * [Cluster Costs](https://github.com/aws-samples/aws-parallelcluster-monitoring/blob/main/grafana/dashboards/costs.json)(beta / in developemnt) - This dashboard shows the cost associated to AWS Service utilized by your cluster. It includes: [EC2](https://aws.amazon.com/ec2/pricing/), [EBS](https://aws.amazon.com/ebs/pricing/), [FSx](https://aws.amazon.com/fsx/lustre/pricing/), [S3](https://aws.amazon.com/s3/pricing/), [EFS](https://aws.amazon.com/efs/pricing/).
@@ -15,38 +15,26 @@ Create a cluster using [AWS ParallelCluster](https://www.hpcworkshops.com/03-hpc
 
 ### PC 3.X
 
-Update your cluster's config by adding the following snippet in the `HeadNode` section:
+Update your cluster's config by adding the following snippet in the `HeadNode` and `Scheduling` section:
 
 ```yaml
 CustomActions:
   OnNodeConfigured:
     Script: https://raw.githubusercontent.com/aws-samples/aws-parallelcluster-monitoring/main/post-install.sh
     Args:
-      - https://github.com/aws-samples/aws-parallelcluster-monitoring/tarball/main
-      - aws-parallelcluster-monitoring
-      - install-monitoring.sh
+      - v0.9
 Iam:
   AdditionalIamPolicies:
     - Policy: arn:aws:iam::aws:policy/CloudWatchFullAccess
     - Policy: arn:aws:iam::aws:policy/AWSPriceListServiceFullAccess
     - Policy: arn:aws:iam::aws:policy/AmazonSSMFullAccess
     - Policy: arn:aws:iam::aws:policy/AWSCloudFormationReadOnlyAccess
 Tags:
-  - Key: Grafana
-    Value: true
+  - Key: 'Grafana'
+    Value: 'true'
 ```
 
-### PC 2.X
-
-```ini
-[cluster yourcluster]
-...
-post_install = https://raw.githubusercontent.com/aws-samples/aws-parallelcluster-monitoring/main/post-install.sh
-post_install_args = https://github.com/aws-samples/aws-parallelcluster-monitoring/tarball/main,aws-parallelcluster-monitoring,install-monitoring.sh
-additional_iam_policies = arn:aws:iam::aws:policy/CloudWatchFullAccess,arn:aws:iam::aws:policy/AWSPriceListServiceFullAccess,arn:aws:iam::aws:policy/AmazonSSMFullAccess,arn:aws:iam::aws:policy/AWSCloudFormationReadOnlyAccess
-tags = {"Grafana" : "true"}
-...
-```
+See the complete example config: [pcluster.yaml](parallelcluster-setup/pcluster.yaml).
 
 ## AWS ParallelCluster
 **AWS ParallelCluster** is an AWS supported Open Source cluster management tool that makes it easy for you to deploy and
@@ -72,76 +60,64 @@ Note: *while almost all components are under the Apache2 license, only **[Promet
 
 ## Example Dashboards
 
+#### Cluster Overview
+
 ![ParallelCluster](docs/ParallelCluster.png?raw=true "AWS ParallelCluster")
 
-![Master](docs/Master.png?raw=true "Master Node")
+#### HeadNode Dashboard
+
+![Head Node](docs/HeadNode.png?raw=true "Head Node")
+
+#### ComputeNodes Dashboard
 
 ![Compute Node List](docs/List.png?raw=true "Compute Node List")
 
+#### Logs
+
 ![Logs](docs/Logs.png?raw=true "AWS ParallelCluster Logs")
 
+#### Cluster Cost
+
 ![Costs](docs/Costs.png?raw=true "Best - AWS ParallelCluster Costs")
 
 
-## How to install it
+## Quickstart
 
-You can simply use the post-install script that you can find [here](https://github.com/aws-samples/aws-parallelcluster-monitoring/blob/main/post-install.sh) as it is, or customize it as you need. For instance, you might want to change your [Grafana password](https://github.com/aws-samples/aws-parallelcluster-monitoring/blob/main/docker-compose/docker-compose.master.yml#L43) to something more secure and meaningful for you, or you might want to customize some dashboards by adding additional components to monitor.
+1. Create a Security Group that allows you to access the `HeadNode` on Port 80 and 443. In the following example we open the security group up to `0.0.0.0/0` however we highly advise restricting this down further. More information on how to create your security groups can be found [here](https://docs.aws.amazon.com/cli/latest/userguide/cli-services-ec2-sg.html#creating-a-security-group)
 
 ```bash
-#Load AWS Parallelcluster environment variables
-. /etc/parallelcluster/cfnconfig
-
-#get GitHub repo to clone and the installation script
-monitoring_url=$(echo ${cfn_postinstall_args}| cut -d ',' -f 1 )
-monitoring_dir_name=$(echo ${cfn_postinstall_args}| cut -d ',' -f 2 )
-monitoring_tarball="${monitoring_dir_name}.tar.gz"
-setup_command=$(echo ${cfn_postinstall_args}| cut -d ',' -f 3 )
-monitoring_home="/home/${cfn_cluster_user}/${monitoring_dir_name}"
-
-case ${cfn_node_type} in
-    MasterServer)
-        wget ${monitoring_url} -O ${monitoring_tarball}
-        mkdir -p ${monitoring_home}
-        tar xvf ${monitoring_tarball} -C ${monitoring_home} --strip-components 1
-    ;;
-    ComputeFleet)
-    
-    ;;
-esac
-
-#Execute the monitoring installation script
-bash -x "${monitoring_home}/parallelcluster-setup/${setup_command}" >/tmp/monitoring-setup.log 2>&1
-exit $?
-``` 
-The proposed post-install script will take care of installing and configuring everything for you through the [install-monitoring.sh](https://github.com/aws-samples/aws-parallelcluster-monitoring/blob/main/parallelcluster-setup/install-monitoring.sh) script. Though, few additional parameters are needed in the AWS ParallelCluster config file: the post_install_args, additional IAM policies, security group, and a tag. You can find an AWS ParallelCluster template [here](https://github.com/aws-samples/aws-parallelcluster-monitoring/blob/main/parallelcluster-setup/pcluster-template.config). Please note that, at the moment, the installation script has only been tested using [Amazon Linux 2](https://aws.amazon.com/amazon-linux-2/).
-
-```ini
-base_os = alinux2
-
-post_install = s3://<my-bucket-name>/post-install.sh
-
-post_install_args = https://github.com/aws-samples/aws-parallelcluster-monitoring/tarball/main,aws-parallelcluster-monitoring,install-monitoring.sh
-
-additional_iam_policies = arn:aws:iam::aws:policy/CloudWatchFullAccess,arn:aws:iam::aws:policy/AWSPriceListServiceFullAccess,arn:aws:iam::aws:policy/AmazonSSMFullAccess,arn:aws:iam::aws:policy/AWSCloudFormationReadOnlyAccess
-
-tags = {“Grafana” : “true”}
+read -p "Please enter the vpc id of your cluster: " vpc_id
+echo -e "creating a security group with $vpc_id..."
+security_group=$(aws ec2 create-security-group --group-name grafana-sg --description "Open HTTP/HTTPS ports" --vpc-id ${vpc_id} --output text)
+aws ec2 authorize-security-group-ingress --group-id ${security_group} --protocol tcp --port 443 --cidr 0.0.0.0/0
+aws ec2 authorize-security-group-ingress --group-id ${security_group} --protocol tcp --port 80 —-cidr 0.0.0.0/0
 ```
 
-Make sure that port `80` and port `443` of your master node are accessible from the internet (or form your network). You can achieve this by creating the appropriate security group via AWS Web-Console or via [CLI](https://docs.aws.amazon.com/cli/index.html), see an example below:
+2. Create a cluster with the post install script [post-install.sh](https://github.com/aws-samples/aws-parallelcluster-monitoring/blob/main/post-install.sh), the Security Group you created above as [AdditionalSecurityGroup](https://docs.aws.amazon.com/parallelcluster/latest/ug/Scheduling-v3.html#yaml-Scheduling-SlurmQueues-Networking-AdditionalSecurityGroups) on the HeadNode, and a few additional IAM Policies. You can find a complete AWS ParallelCluster template [here](parallelcluster-setup/pcluster.yaml). Please note that, at the moment, the installation script has only been tested using [Amazon Linux 2](https://aws.amazon.com/amazon-linux-2/).
 
-```bash
-aws ec2 create-security-group --group-name my-grafana-sg --description "Open HTTP/HTTPS ports" —vpc-id vpc-1a2b3c4d
-aws ec2 authorize-security-group-ingress --group-id sg-12345 --protocol tcp --port 443 —cidr 0.0.0.0/0
-aws ec2 authorize-security-group-ingress --group-id sg-12345 --protocol tcp --port 80 —cidr 0.0.0.0/0
+```yaml
+CustomActions:
+  OnNodeConfigured:
+    Script: https://raw.githubusercontent.com/aws-samples/aws-parallelcluster-monitoring/main/post-install.sh
+    Args:
+      - v0.9
+Iam:
+  AdditionalIamPolicies:
+    - Policy: arn:aws:iam::aws:policy/CloudWatchFullAccess
+    - Policy: arn:aws:iam::aws:policy/AWSPriceListServiceFullAccess
+    - Policy: arn:aws:iam::aws:policy/AmazonSSMFullAccess
+    - Policy: arn:aws:iam::aws:policy/AWSCloudFormationReadOnlyAccess
+Tags:
+  - Key: 'Grafana'
+    Value: 'true'
 ```
 
-More information on how to create your security groups [here](https://docs.aws.amazon.com/cli/latest/userguide/cli-services-ec2-sg.html#creating-a-security-group).  
-Finally, set the additional_sg parameter in the `[VPC]` section of your ParallelCluster config file.  
-After your cluster is created, you can just open a web-browser and connect to `https://your_public_ip` or `http://your_public_ip` (all `http` connections will be automatically redirected to `https`), a landing page will be presented to you with links to the Prometheus database service and the Grafana dashboards.
+3. Connect to `https://headnode_public_ip` or `http://headnode_public_ip` (all `http` connections will be automatically redirected to `https`) and authenticate with the default [Grafana password](https://github.com/aws-samples/aws-parallelcluster-monitoring/blob/main/docker-compose/docker-compose.master.yml#L43). A landing page will be presented to you with links to the Prometheus database service and the Grafana dashboards.
 
+![Login Screen](docs/Login1.png?raw=true "Login Screen")
+![Login Screen](docs/Login2.png?raw=true "Login Screen")
 
-Note: *Because of the higher volume of network traffic due to the compute nodes continuously pushing metrics to the master node,
-in case you expect to run a large scale cluster (hundreds of instances), we would recommend to use an instance type slightly bigger than what you planned for your master node.*
+Note: *Because of the higher volume of network traffic due to the compute nodes continuously pushing metrics to the HeadNode, in case you expect to run a large scale cluster (hundreds of instances), we would recommend to use an instance type slightly bigger than what you planned for your master node.*
 
 ## Security
 
 
@@ -18,7 +18,7 @@ usermod -a -G docker $cfn_cluster_user
 curl -L "https://github.com/docker/compose/releases/download/1.27.4/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose
 chmod +x /usr/local/bin/docker-compose
 
-monitoring_dir_name=${cfn_postinstall_args[1]}
+monitoring_dir_name=aws-parallelcluster-monitoring
 monitoring_home="/home/${cfn_cluster_user}/${monitoring_dir_name}"
 
 echo "$> variable monitoring_dir_name -> ${monitoring_dir_name}"
 
@@ -4,16 +4,18 @@
 # Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
 # SPDX-License-Identifier: MIT-0
 #
-#
+# Usage: ./post-install [version]
 
 #Load AWS Parallelcluster environment variables
 . /etc/parallelcluster/cfnconfig
 
-#get GitHub repo to clone and the installation script
-monitoring_url=${cfn_postinstall_args[0]}
-monitoring_dir_name=${cfn_postinstall_args[1]}
+version=${1:-v0.9}
+monitoring_dir_name=aws-parallelcluster-monitoring
 monitoring_tarball="${monitoring_dir_name}.tar.gz"
-setup_command=${cfn_postinstall_args[2]}
+
+#get GitHub repo to clone and the installation script
+monitoring_url=https://github.com/aws-samples/aws-parallelcluster-monitoring/archive/refs/tags/${version}.tar.gz
+setup_command=install-monitoring.sh
 monitoring_home="/home/${cfn_cluster_user}/${monitoring_dir_name}"
 
 case ${cfn_node_type} in