Skip to content
This repository was archived by the owner on Nov 20, 2024. It is now read-only.

Commit 62315a6

Browse files
authored
Add steps to run benchmark (#11)
1 parent 43e5560 commit 62315a6

File tree

1 file changed

+102
-0
lines changed

1 file changed

+102
-0
lines changed

benchmark/README.md

+102
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,102 @@
1+
## Benchmark
2+
3+
If you would like to run the machine learning inference benchmark to evaluate the performance when running multiple pods on one GPU, you may following below steps to get the benchmark result:
4+
5+
1. Set environment variable with your AWS account ID and default region
6+
7+
```bash
8+
export ACCOUNT_ID=123456789012
9+
export AWS_DEFAULT_REGION=us-west-2
10+
```
11+
12+
2. Build the tensorflow-benchmark image:
13+
14+
```bash
15+
$ cat <<EOF | docker build -t ${ACCOUNT_ID}.dkr.ecr.${AWS_DEFAULT_REGION}.amazonaws.com/tensorflow-benchmark:v1.15.2 -
16+
FROM alpine as intermediate
17+
LABEL stage=intermediate
18+
RUN apk update && \
19+
apk add --update git && \
20+
git clone https://github.com/tensorflow/benchmarks.git && \
21+
cd benchmarks && \
22+
git checkout cnn_tf_v1.15_compatible
23+
24+
# Choose the base image for our final image
25+
FROM tensorflow/tensorflow:1.15.2-gpu
26+
RUN mkdir /opt/benchmarks
27+
COPY --from=intermediate /benchmarks /opt/benchmarks
28+
EOF
29+
```
30+
31+
3. Create ECR repository, login to ECR and upload image to ECR
32+
33+
```bash
34+
aws ecr create-repository --repository-name tensorflow-benchmark
35+
$(aws ecr get-login --no-include-email)
36+
docker push ${ACCOUNT_ID}.dkr.ecr.${AWS_DEFAULT_REGION}.amazonaws.com/tensorflow-benchmark:v1.15.2
37+
```
38+
39+
4. Run tensorflow benchmark jobs in parallel, you may change the model name to resnet101, inception3, vgg16, please refer to [benchmark](https://github.com/tensorflow/benchmarks) for more informaiton about the parameters.
40+
41+
```bash
42+
$ cat <<EOF | kubectl apply -f -
43+
apiVersion: batch/v1
44+
kind: Job
45+
metadata:
46+
name: tf-benchmark
47+
spec:
48+
completions: 4
49+
parallelism: 4
50+
backoffLimit: 1
51+
template:
52+
spec:
53+
restartPolicy: Never
54+
hostIPC: true
55+
containers:
56+
- name: tf-benchmark
57+
image: ${ACCOUNT_ID}.dkr.ecr.${AWS_DEFAULT_REGION}.amazonaws.com/tensorflow-benchmark:v1.15.2
58+
args:
59+
- "python3"
60+
- "/opt/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py"
61+
- "--data_name=imagenet"
62+
- "--model=resnet50"
63+
- "--num_batches=100"
64+
- "--batch_size=4"
65+
- "--num_gpus=1"
66+
- "--gpu_memory_frac_for_testing=0.2"
67+
resources:
68+
limits:
69+
k8s.amazonaws.com/vgpu: 2
70+
volumeMounts:
71+
- name: nvidia-mps
72+
mountPath: /tmp/nvidia-mps
73+
volumes:
74+
- name: nvidia-mps
75+
hostPath:
76+
path: /tmp/nvidia-mps
77+
EOF
78+
```
79+
80+
5. Wait for jobs to complete
81+
82+
```bash
83+
kubectl wait --for=condition=complete --timeout=30m job/tf-benchmark > /dev/null
84+
```
85+
86+
6. Get result of each job
87+
88+
```bash
89+
$ for podName in $(kubectl get pods -l job-name=tf-benchmark --no-headers -o custom-columns=":metadata.name")
90+
do
91+
score=$(kubectl logs $podName | grep 'total images/sec: ' | sed -E 's/total\ images\/sec\:\ (.*)/\1/g')
92+
echo $score
93+
done
94+
```
95+
96+
7. You may get benchmark result as following, it represent the detected images per second in each job.
97+
```bash
98+
15.47
99+
15.46
100+
16.00
101+
15.58
102+
```

0 commit comments

Comments
 (0)