Skip to content

Commit 2e8e23f

Browse files
author
Sean Smith
authored
Merge pull request #28 from aws-samples/simplify-deployment
Simplify Deployment args
2 parents 09b9052 + 0f274b9 commit 2e8e23f

File tree

6 files changed

+53
-75
lines changed

6 files changed

+53
-75
lines changed

README.md

Lines changed: 45 additions & 69 deletions
Original file line numberDiff line numberDiff line change
@@ -3,9 +3,9 @@
33
This is a sample solution based on Grafana for monitoring various component of an HPC cluster built with AWS ParallelCluster.
44
There are 6 dashboards that can be used as they are or customized as you need.
55
* [ParallelCluster Summary](https://github.com/aws-samples/aws-parallelcluster-monitoring/blob/main/grafana/dashboards/ParallelCluster.json) - this is the main dashboard that shows general monitoring info and metrics for the whole cluster. It includes Slurm metrics and Storage performance metrics.
6-
* [Master Node Details](https://github.com/aws-samples/aws-parallelcluster-monitoring/blob/main/grafana/dashboards/master-node-details.json) - this dashboard shows detailed metric for the Master node, including CPU, Memory, Network and Storage usage.
6+
* [HeadNode Details](https://github.com/aws-samples/aws-parallelcluster-monitoring/blob/main/grafana/dashboards/master-node-details.json) - this dashboard shows detailed metric for the HeadNode, including CPU, Memory, Network and Storage usage.
77
* [Compute Node List](https://github.com/aws-samples/aws-parallelcluster-monitoring/blob/main/grafana/dashboards/compute-node-list.json) - this dashboard show the list of the available compute nodes. Each entry is a link to a more detailed page.
8-
* [Compute Node Details](https://github.com/aws-samples/aws-parallelcluster-monitoring/blob/main/grafana/dashboards/compute-node-details.json) - similarly to the master node details this dashboard show the same metric for the compute nodes.
8+
* [Compute Node Details](https://github.com/aws-samples/aws-parallelcluster-monitoring/blob/main/grafana/dashboards/compute-node-details.json) - similarly to the HeadNode details this dashboard show the same metric for the compute nodes.
99
* [GPU Nodes Details](https://github.com/aws-samples/aws-parallelcluster-monitoring/blob/main/grafana/dashboards/gpu.json) - This dashboard shows GPUs releated metrics collected using nvidia-dcgm container.
1010
* [Cluster Logs](https://github.com/aws-samples/aws-parallelcluster-monitoring/blob/main/grafana/dashboards/logs.json) - This dashboard shows all the logs of your HPC Cluster. The logs are pushed by AWS ParallelCluster to AWS ClowdWatch Logs and finally reported here.
1111
* [Cluster Costs](https://github.com/aws-samples/aws-parallelcluster-monitoring/blob/main/grafana/dashboards/costs.json)(beta / in developemnt) - This dashboard shows the cost associated to AWS Service utilized by your cluster. It includes: [EC2](https://aws.amazon.com/ec2/pricing/), [EBS](https://aws.amazon.com/ebs/pricing/), [FSx](https://aws.amazon.com/fsx/lustre/pricing/), [S3](https://aws.amazon.com/s3/pricing/), [EFS](https://aws.amazon.com/efs/pricing/).
@@ -15,38 +15,26 @@ Create a cluster using [AWS ParallelCluster](https://www.hpcworkshops.com/03-hpc
1515

1616
### PC 3.X
1717

18-
Update your cluster's config by adding the following snippet in the `HeadNode` section:
18+
Update your cluster's config by adding the following snippet in the `HeadNode` and `Scheduling` section:
1919

2020
```yaml
2121
CustomActions:
2222
OnNodeConfigured:
2323
Script: https://raw.githubusercontent.com/aws-samples/aws-parallelcluster-monitoring/main/post-install.sh
2424
Args:
25-
- https://github.com/aws-samples/aws-parallelcluster-monitoring/tarball/main
26-
- aws-parallelcluster-monitoring
27-
- install-monitoring.sh
25+
- v0.9
2826
Iam:
2927
AdditionalIamPolicies:
3028
- Policy: arn:aws:iam::aws:policy/CloudWatchFullAccess
3129
- Policy: arn:aws:iam::aws:policy/AWSPriceListServiceFullAccess
3230
- Policy: arn:aws:iam::aws:policy/AmazonSSMFullAccess
3331
- Policy: arn:aws:iam::aws:policy/AWSCloudFormationReadOnlyAccess
3432
Tags:
35-
- Key: Grafana
36-
Value: true
33+
- Key: 'Grafana'
34+
Value: 'true'
3735
```
3836
39-
### PC 2.X
40-
41-
```ini
42-
[cluster yourcluster]
43-
...
44-
post_install = https://raw.githubusercontent.com/aws-samples/aws-parallelcluster-monitoring/main/post-install.sh
45-
post_install_args = https://github.com/aws-samples/aws-parallelcluster-monitoring/tarball/main,aws-parallelcluster-monitoring,install-monitoring.sh
46-
additional_iam_policies = arn:aws:iam::aws:policy/CloudWatchFullAccess,arn:aws:iam::aws:policy/AWSPriceListServiceFullAccess,arn:aws:iam::aws:policy/AmazonSSMFullAccess,arn:aws:iam::aws:policy/AWSCloudFormationReadOnlyAccess
47-
tags = {"Grafana" : "true"}
48-
...
49-
```
37+
See the complete example config: [pcluster.yaml](parallelcluster-setup/pcluster.yaml).
5038
5139
## AWS ParallelCluster
5240
**AWS ParallelCluster** is an AWS supported Open Source cluster management tool that makes it easy for you to deploy and
@@ -72,76 +60,64 @@ Note: *while almost all components are under the Apache2 license, only **[Promet
7260
7361
## Example Dashboards
7462
63+
#### Cluster Overview
64+
7565
![ParallelCluster](docs/ParallelCluster.png?raw=true "AWS ParallelCluster")
7666
77-
![Master](docs/Master.png?raw=true "Master Node")
67+
#### HeadNode Dashboard
68+
69+
![Head Node](docs/HeadNode.png?raw=true "Head Node")
70+
71+
#### ComputeNodes Dashboard
7872
7973
![Compute Node List](docs/List.png?raw=true "Compute Node List")
8074
75+
#### Logs
76+
8177
![Logs](docs/Logs.png?raw=true "AWS ParallelCluster Logs")
8278
79+
#### Cluster Cost
80+
8381
![Costs](docs/Costs.png?raw=true "Best - AWS ParallelCluster Costs")
8482
8583
86-
## How to install it
84+
## Quickstart
8785
88-
You can simply use the post-install script that you can find [here](https://github.com/aws-samples/aws-parallelcluster-monitoring/blob/main/post-install.sh) as it is, or customize it as you need. For instance, you might want to change your [Grafana password](https://github.com/aws-samples/aws-parallelcluster-monitoring/blob/main/docker-compose/docker-compose.master.yml#L43) to something more secure and meaningful for you, or you might want to customize some dashboards by adding additional components to monitor.
86+
1. Create a Security Group that allows you to access the `HeadNode` on Port 80 and 443. In the following example we open the security group up to `0.0.0.0/0` however we highly advise restricting this down further. More information on how to create your security groups can be found [here](https://docs.aws.amazon.com/cli/latest/userguide/cli-services-ec2-sg.html#creating-a-security-group)
8987

9088
```bash
91-
#Load AWS Parallelcluster environment variables
92-
. /etc/parallelcluster/cfnconfig
93-
94-
#get GitHub repo to clone and the installation script
95-
monitoring_url=$(echo ${cfn_postinstall_args}| cut -d ',' -f 1 )
96-
monitoring_dir_name=$(echo ${cfn_postinstall_args}| cut -d ',' -f 2 )
97-
monitoring_tarball="${monitoring_dir_name}.tar.gz"
98-
setup_command=$(echo ${cfn_postinstall_args}| cut -d ',' -f 3 )
99-
monitoring_home="/home/${cfn_cluster_user}/${monitoring_dir_name}"
100-
101-
case ${cfn_node_type} in
102-
MasterServer)
103-
wget ${monitoring_url} -O ${monitoring_tarball}
104-
mkdir -p ${monitoring_home}
105-
tar xvf ${monitoring_tarball} -C ${monitoring_home} --strip-components 1
106-
;;
107-
ComputeFleet)
108-
109-
;;
110-
esac
111-
112-
#Execute the monitoring installation script
113-
bash -x "${monitoring_home}/parallelcluster-setup/${setup_command}" >/tmp/monitoring-setup.log 2>&1
114-
exit $?
115-
```
116-
The proposed post-install script will take care of installing and configuring everything for you through the [install-monitoring.sh](https://github.com/aws-samples/aws-parallelcluster-monitoring/blob/main/parallelcluster-setup/install-monitoring.sh) script. Though, few additional parameters are needed in the AWS ParallelCluster config file: the post_install_args, additional IAM policies, security group, and a tag. You can find an AWS ParallelCluster template [here](https://github.com/aws-samples/aws-parallelcluster-monitoring/blob/main/parallelcluster-setup/pcluster-template.config). Please note that, at the moment, the installation script has only been tested using [Amazon Linux 2](https://aws.amazon.com/amazon-linux-2/).
117-
118-
```ini
119-
base_os = alinux2
120-
121-
post_install = s3://<my-bucket-name>/post-install.sh
122-
123-
post_install_args = https://github.com/aws-samples/aws-parallelcluster-monitoring/tarball/main,aws-parallelcluster-monitoring,install-monitoring.sh
124-
125-
additional_iam_policies = arn:aws:iam::aws:policy/CloudWatchFullAccess,arn:aws:iam::aws:policy/AWSPriceListServiceFullAccess,arn:aws:iam::aws:policy/AmazonSSMFullAccess,arn:aws:iam::aws:policy/AWSCloudFormationReadOnlyAccess
126-
127-
tags = {“Grafana” : “true”}
89+
read -p "Please enter the vpc id of your cluster: " vpc_id
90+
echo -e "creating a security group with $vpc_id..."
91+
security_group=$(aws ec2 create-security-group --group-name grafana-sg --description "Open HTTP/HTTPS ports" --vpc-id ${vpc_id} --output text)
92+
aws ec2 authorize-security-group-ingress --group-id ${security_group} --protocol tcp --port 443 --cidr 0.0.0.0/0
93+
aws ec2 authorize-security-group-ingress --group-id ${security_group} --protocol tcp --port 80 —-cidr 0.0.0.0/0
12894
```
12995

130-
Make sure that port `80` and port `443` of your master node are accessible from the internet (or form your network). You can achieve this by creating the appropriate security group via AWS Web-Console or via [CLI](https://docs.aws.amazon.com/cli/index.html), see an example below:
96+
2. Create a cluster with the post install script [post-install.sh](https://github.com/aws-samples/aws-parallelcluster-monitoring/blob/main/post-install.sh), the Security Group you created above as [AdditionalSecurityGroup](https://docs.aws.amazon.com/parallelcluster/latest/ug/Scheduling-v3.html#yaml-Scheduling-SlurmQueues-Networking-AdditionalSecurityGroups) on the HeadNode, and a few additional IAM Policies. You can find a complete AWS ParallelCluster template [here](parallelcluster-setup/pcluster.yaml). Please note that, at the moment, the installation script has only been tested using [Amazon Linux 2](https://aws.amazon.com/amazon-linux-2/).
13197

132-
```bash
133-
aws ec2 create-security-group --group-name my-grafana-sg --description "Open HTTP/HTTPS ports" —vpc-id vpc-1a2b3c4d
134-
aws ec2 authorize-security-group-ingress --group-id sg-12345 --protocol tcp --port 443 —cidr 0.0.0.0/0
135-
aws ec2 authorize-security-group-ingress --group-id sg-12345 --protocol tcp --port 80 —cidr 0.0.0.0/0
98+
```yaml
99+
CustomActions:
100+
OnNodeConfigured:
101+
Script: https://raw.githubusercontent.com/aws-samples/aws-parallelcluster-monitoring/main/post-install.sh
102+
Args:
103+
- v0.9
104+
Iam:
105+
AdditionalIamPolicies:
106+
- Policy: arn:aws:iam::aws:policy/CloudWatchFullAccess
107+
- Policy: arn:aws:iam::aws:policy/AWSPriceListServiceFullAccess
108+
- Policy: arn:aws:iam::aws:policy/AmazonSSMFullAccess
109+
- Policy: arn:aws:iam::aws:policy/AWSCloudFormationReadOnlyAccess
110+
Tags:
111+
- Key: 'Grafana'
112+
Value: 'true'
136113
```
137114

138-
More information on how to create your security groups [here](https://docs.aws.amazon.com/cli/latest/userguide/cli-services-ec2-sg.html#creating-a-security-group).
139-
Finally, set the additional_sg parameter in the `[VPC]` section of your ParallelCluster config file.
140-
After your cluster is created, you can just open a web-browser and connect to `https://your_public_ip` or `http://your_public_ip` (all `http` connections will be automatically redirected to `https`), a landing page will be presented to you with links to the Prometheus database service and the Grafana dashboards.
115+
3. Connect to `https://headnode_public_ip` or `http://headnode_public_ip` (all `http` connections will be automatically redirected to `https`) and authenticate with the default [Grafana password](https://github.com/aws-samples/aws-parallelcluster-monitoring/blob/main/docker-compose/docker-compose.master.yml#L43). A landing page will be presented to you with links to the Prometheus database service and the Grafana dashboards.
141116

117+
![Login Screen](docs/Login1.png?raw=true "Login Screen")
118+
![Login Screen](docs/Login2.png?raw=true "Login Screen")
142119

143-
Note: *Because of the higher volume of network traffic due to the compute nodes continuously pushing metrics to the master node,
144-
in case you expect to run a large scale cluster (hundreds of instances), we would recommend to use an instance type slightly bigger than what you planned for your master node.*
120+
Note: *Because of the higher volume of network traffic due to the compute nodes continuously pushing metrics to the HeadNode, in case you expect to run a large scale cluster (hundreds of instances), we would recommend to use an instance type slightly bigger than what you planned for your master node.*
145121

146122
## Security
147123

File renamed without changes.

docs/Login1.png

342 KB
Loading

docs/Login2.png

2.58 MB
Loading

parallelcluster-setup/install-monitoring.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ usermod -a -G docker $cfn_cluster_user
1818
curl -L "https://github.com/docker/compose/releases/download/1.27.4/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose
1919
chmod +x /usr/local/bin/docker-compose
2020

21-
monitoring_dir_name=${cfn_postinstall_args[1]}
21+
monitoring_dir_name=aws-parallelcluster-monitoring
2222
monitoring_home="/home/${cfn_cluster_user}/${monitoring_dir_name}"
2323

2424
echo "$> variable monitoring_dir_name -> ${monitoring_dir_name}"

post-install.sh

Lines changed: 7 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -4,16 +4,18 @@
44
# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
55
# SPDX-License-Identifier: MIT-0
66
#
7-
#
7+
# Usage: ./post-install [version]
88

99
#Load AWS Parallelcluster environment variables
1010
. /etc/parallelcluster/cfnconfig
1111

12-
#get GitHub repo to clone and the installation script
13-
monitoring_url=${cfn_postinstall_args[0]}
14-
monitoring_dir_name=${cfn_postinstall_args[1]}
12+
version=${1:-v0.9}
13+
monitoring_dir_name=aws-parallelcluster-monitoring
1514
monitoring_tarball="${monitoring_dir_name}.tar.gz"
16-
setup_command=${cfn_postinstall_args[2]}
15+
16+
#get GitHub repo to clone and the installation script
17+
monitoring_url=https://github.com/aws-samples/aws-parallelcluster-monitoring/archive/refs/tags/${version}.tar.gz
18+
setup_command=install-monitoring.sh
1719
monitoring_home="/home/${cfn_cluster_user}/${monitoring_dir_name}"
1820

1921
case ${cfn_node_type} in

0 commit comments

Comments
 (0)