@@ -73,6 +73,8 @@ mistralai/Mistral-7B-v0.1
73
73
mistralai/Mistral-7B-Instruct-v0.1
74
74
mistralai/Mistral-7B-Instruct-v0.2
75
75
meta-llama/Meta-Llama-3-8B
76
+ meta-llama/Meta-Llama-3.1-8B
77
+ meta-llama/Meta-Llama-3.1-70B
76
78
meta-llama/Meta-Llama-3.1-405B
77
79
```
78
80
@@ -93,8 +95,10 @@ Benchmarks run on an 8xA100-80GB, power limited to 330W with a hybrid cube mesh
93
95
| Llama-2-70B | Base | OOM ||
94
96
| | 8-bit | 19.13 | 1322.58 |
95
97
| | 4-bit (G=32) | 25.25 | 1097.66 |
96
- | Llama-3-8B | Base | 94.25 | 1411.95 |
97
- | | 8-bit | 139.55 | 1047.23 |
98
+ | Llama-3.1-8B | Base | 93.89 | 1410.76 |
99
+ | | 8-bit | 137.64 | 1030.89 |
100
+ | Llama-3.1-70B | Base | OOM ||
101
+ | | 8-bit | 18.04 | 1253.78 |
98
102
99
103
### Speculative Sampling
100
104
[ Verifier: Llama-70B (int4), Draft: Llama-7B (int4)] ( ./scripts/speculate_70B_int4.sh ) : 48.4 tok/s
@@ -110,17 +114,23 @@ Benchmarks run on an 8xA100-80GB, power limited to 330W with a hybrid cube mesh
110
114
| | 2 | 21.32 | 1481.87 |
111
115
| | 4 | 38.01 | 1340.76 |
112
116
| | 8 | 62.50 | 1135.29 |
113
- | Llama-3-8B | 1 | 94.19 | 1411.76 |
114
- | | 2 | 150.48 | 1208.80 |
115
- | | 4 | 219.77 | 991.63 |
116
- | | 8 | 274.65 | 768.55 |
117
+ | Llama-3.1-8B | 1 | 93.83 | 1408.37 |
118
+ | | 2 | 149.10 | 1197.32 |
119
+ | | 4 | 217.21 | 986.32 |
120
+ | | 8 | 276.01 | 772.60 |
121
+ | Llama-3.1-70B | 1 | OOM | |
122
+ | | 2 | 16.03 | 1130.81 |
123
+ | | 4 | 37.45 | 1360.53 |
124
+ | | 8 | 58.78 | 1129.61 |
117
125
118
126
### Tensor Parallelism + Quantization
119
127
| Model | Technique | Tokens/Second | Memory Bandwidth (GB/s) |
120
128
| -------- | ------- | ------ | ------ |
121
129
| Llama-2-70B | Base | 62.50 | 1135.29 |
122
130
| | 8-bit | 80.44 | 752.04 |
123
131
| | 4-bit (G=32) | 90.77 | 548.10 |
132
+ | Llama-3.1-70B | Base | 58.78 | 1129.61 |
133
+ | | 8-bit | 75.58 | 726.57 |
124
134
| Llama-3.1-405B | 8-bit | 15.60 | 815.87 |
125
135
126
136
### AMD
0 commit comments