-
Notifications
You must be signed in to change notification settings - Fork 104
/
Copy pathindex.html
360 lines (355 loc) · 13.1 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<link rel="stylesheet" href="../common-revealjs/css/reveal.css">
<link rel="stylesheet" href="../common-revealjs/css/theme/white.css">
<link rel="stylesheet" href="../common-revealjs/css/custom.css">
<script>
// This is needed when printing the slides to pdf
var link = document.createElement( 'link' );
link.rel = 'stylesheet';
link.type = 'text/css';
link.href = window.location.search.match( /print-pdf/gi ) ? '../common-revealjs/css/print/pdf.css' : '../common-revealjs/css/print/paper.css';
document.getElementsByTagName( 'head' )[0].appendChild( link );
</script>
<script>
// This is used to display the static images on each slide,
// See global-images in this html file and custom.css
(function() {
if(window.addEventListener) {
window.addEventListener('load', () => {
let slides = document.getElementsByClassName("slide-background");
if (slides.length === 0) {
slides = document.getElementsByClassName("pdf-page")
}
// Insert global images on each slide
for(let i = 0, max = slides.length; i < max; i++) {
let cln = document.getElementById("global-images").cloneNode(true);
cln.removeAttribute("id");
slides[i].appendChild(cln);
}
// Remove top level global images
let elem = document.getElementById("global-images");
elem.parentElement.removeChild(elem);
}, false);
}
})();
</script>
</head>
<body>
<div class="reveal">
<div class="slides">
<div id="global-images" class="global-images">
<img src="../common-revealjs/images/sycl_academy.png" />
<img src="../common-revealjs/images/sycl_logo.png" />
<img src="../common-revealjs/images/trademarks.png" />
<img src="../common-revealjs/images/codeplay.png" />
</div>
<!--Slide 1-->
<section class="hbox">
<div class="hbox" data-markdown>
## Further Optimizations
</div>
</section>
<!--Slide 2-->
<section class="hbox" data-markdown>
## Learning Objectives
* Learn about compute and memory bound algorithms
* Learn about optimizing for occupancy
* Learn about optimizing for throughput
</section>
<!--Slide 3-->
<section>
<div class="hbox" data-markdown>
#### Compute-bound vs memory-bound
</div>
<div class="container" data-markdown>
* When a particular algorithm is applied to a processor such as a GPU it will generally be either compute-bound or memory-bound.
* If an algorithm is compute-bound then the limiting factor is occupancy, utilizing the available hardware resources.
* If an algorithm is memory-bound then the limiting factor is throughput, reducing memory latency.
</div>
</section>
<!--Slide 4-->
<section>
<div class="hbox" data-markdown>
#### Optimizing for occupancy
</div>
<div class="container" data-markdown>
* Occupancy is a metric used to measure how much of the computing power of a processor such as a GPU is being used.
* Optimizing for occupancy means getting the most out of the resources of the GPU.
* For memory-bound algorithms you don't necessarily need 100% occupancy, you simply need enough to hide the latency of global memory access.
* For compute-bound algorithms occupancy becomes more crucial, even then it's not always possible to achieve 100% occupancy.
</div>
</section>
<!--Slide 5-->
<section>
<div class="hbox" data-markdown>
#### Limiting factors
</div>
<div class="container" data-markdown>
There are a number of factors which can limit occupancy for a given kernel function and the data you pass to it.
</div>
<div class="container" data-markdown>
* Number of work-items required.
* Amount of private memory (registers) required by each work-item.
* Number of work-items per work-group.
* Amount of local memory required by each work-group.
</div>
</section>
<!--Slide 6-->
<section>
<div class="hbox" data-markdown>
#### Choosing an optimal work-group size
</div>
<div class="container" data-markdown>
An important optimization for occupancy is to choose an optimal work-group size.
</div>
</section>
<!--Slide 7-->
<section>
<div class="hbox" data-markdown>
#### Local memory image convolution performance
</div>
<div class="container"data-markdown>

</div>
</section>
<!--Slide 8-->
<section>
<div class="hbox" data-markdown>
#### Choosing an optimal work-group size
</div>
<div class="container" data-markdown>
* It must be smaller than the maximum work-group size for the device you are targeting.
* It must be large enough that you are utilizing the hardware concurrency (warp or wavefront)
* It should be a power of 2 to allow multiple work-groups to fit into a compute unit.
* It's best to experiment with different sizes and benchmark the performance of each.
</div>
</section>
<!--Slide 9-->
<section>
<div class="hbox" data-markdown>
#### Varying work-group size image convolution performance
</div>
<div class="container"data-markdown>

</div>
</section>
<!--Slide 10-->
<section>
<div class="hbox" data-markdown>
#### Reusing data
</div>
<div class="container" data-markdown>
Another important optimization for occupancy is to re-use data as much as possible.
</div>
<div class="container" data-markdown>
* Larger work-group sizes often mean that more work-items can share partial results and overall less memory operations are required.
* Though you can hit limitations in the number of work-items required or the amount of local memory required.
* If you hit this limitation then you can perform work in batches.
</div>
</section>
<!--Slide 11-->
<section>
<div class="hbox" data-markdown>
#### Work batching
</div>
<div class="container">
<div class="col" data-markdown>

</div>
<div class="col" data-markdown>
* If you hit occupancy limitations then each compute unit must process multiple rounds of work-groups and work-items.
* This often means reading and writing often the same data again.
* Batching work for each work-item that share neighboring data allows you to further share local memory and registers.
</div>
</div>
</section>
<!--Slide 12-->
<section>
<div class="hbox" data-markdown>
#### Optimizing for throughput
</div>
<div class="container" data-markdown>
* Throughput is a metric used to measure how much data is being passed through the GPU.
* Optimizing for throughput means getting the most out of the memory bandwidth at the various levels of the memory hierarchy.
* The key is to hide the latency of the memory transfers and avoid GPU resources being idle.
* This is most important for memory-bound algorithms.
</div>
</section>
<!--Slide 13-->
<section>
<div class="hbox" data-markdown>
#### Optimizing for throughput
</div>
<div class="container" data-markdown>
We have seen a number of the techniques to optimize throughput already.
</div>
<div class="container" data-markdown>
* Coalescing global memory access.
* Using local memory.
* Explicit vectorization.
</div>
</section>
<!--Slide 14-->
<section>
<div class="hbox" data-markdown>
#### Optimizing for throughput
</div>
<div class="container" data-markdown>
There are a number of other optimization to consider, some depend on the application or the target device.
</div>
<div class="container" data-markdown>
* Double buffering.
* Using constant memory.
* Using texture memory.
</div>
</section>
<!--Slide 15-->
<section>
<div class="hbox" data-markdown>
#### Double buffering
</div>
<div class="container" data-markdown>
* With particularly large data you can hit global memory
limitations.
* When this happens you need to pipeline the work,
execute the kernel function itself multiple times,
each time operating on a tile of the input data.
* But copying to global memory can be very expensive.
* To maintain performance you need to hide the latency
of moving the data to the device.
</div>
</section>
<!--Slide 16-->
<section>
<div class="hbox" data-markdown>
#### Double buffering
</div>
<div class="container" data-markdown>

</div>
<div class="container" data-markdown>
* So in this case you have multiple invocation of the
kernel function.
* And between each invocation the data of the next tile
must be moved to the GPU.
</div>
</section>
<!--Slide 17-->
<section>
<div class="hbox" data-markdown>
#### Double buffering
</div>
<div class="container" data-markdown>

</div>
<div class="container" data-markdown>
* With the cost of moving data to the GPU being very
high so this causes significant gaps where the GPU is
idle.
</div>
</section>
<!--Slide 18-->
<section>
<div class="hbox" data-markdown>
#### Double buffering
</div>
<div class="container" data-markdown>

</div>
<div class="container" data-markdown>
* An option for optimizing this is to double buffer the data being moved to and accessed on the GPU.
</div>
</section>
<!--Slide 19-->
<section>
<div class="hbox" data-markdown>
#### Double buffering
</div>
<div class="container" data-markdown>

</div>
<div class="container" data-markdown>
* This allows overlapping of compute and data movement.
* This means that while one kernel invocation computes
a tile of data the data for the next tile is being
moved.
</div>
</section>
<!--Slide 20-->
<section>
<div class="hbox" data-markdown>
#### Use constant memory
</div>
<div class="container" data-markdown>

</div>
<div class="container" data-markdown>
* Some SYCL devices provide a benefit from using constant memory.
* This a dedicated region of global memory that is read-only, and therefore can be faster to access.
* Not all devices will benefit from using constant memory.
* There is generally a much lower limit to what you can allocate in constant memory.
* To use constant memory simply create an `accessor` with the `access::target::constant_buffer` access target.
</div>
</section>
<!--Slide 21-->
<section>
<div class="hbox" data-markdown>
#### Use texture memory
</div>
<div class="container" data-markdown>

</div>
<div class="container" data-markdown>
* Some SYCL devices provide a benefit from using constant memory.
* This the texture memory used by the render pipeline.
* This can be more efficient when accessing data in a pixel format.
* To use texture memory use the `image` class.
</div>
</section>
<!--Slide 22-->
<section>
<div class="hbox" data-markdown>
#### Further tips
</div>
<div class="container" data-markdown>
* Profile your kernel functions.
* Follow vendor optimization guides.
* Tune your algorithms.
</div>
</section>
<!--Slide 23-->
<section>
<div class="hbox" data-markdown>
## Questions
</div>
</section>
<!--Slide 24-->
<section>
<div class="hbox" data-markdown>
#### Exercise
</div>
<div class="container" data-markdown>
Code_Exercises/Work_Group_Sizes/source
</div>
<div class="container" data-markdown>
Try out different work-group sizes and measure the performance.
</div>
<div class="container" data-markdown>
Try out some of the other optimization techniques we've looked at here.
</div>
</section>
</div>
</div>
<script src="../common-revealjs/js/reveal.js"></script>
<script src="../common-revealjs/plugin/markdown/marked.js"></script>
<script src="../common-revealjs/plugin/markdown/markdown.js"></script>
<script src="../common-revealjs/plugin/notes/notes.js"></script>
<script>
Reveal.initialize({mouseWheel: true, defaultNotes: true});
Reveal.configure({ slideNumber: true });
</script>
</body>
</html>