Skip to content

Commit e4cd57f

Browse files
authored
Refactoring Docs/ folder to be compatible with the upcoming website. (m4rs-mt#849)
1 parent aa771e1 commit e4cd57f

24 files changed

+603
-379
lines changed

.gitignore

+3
Original file line numberDiff line numberDiff line change
@@ -246,6 +246,9 @@ ModelManifest.xml
246246
# Rider
247247
.idea/
248248

249+
# macOS
250+
.DS_Store
251+
249252
# Ignore specific template outputs
250253
Src/ILGPU/AtomicFunctions.cs
251254
Src/ILGPU/Backends/PTX/PTXIntrinsics.Generated.cs
Original file line numberDiff line numberDiff line change
@@ -1,27 +1,30 @@
1-
# What is ILGPU
1+
# What is ILGPU
22

3-
ILGPU provides an interface for programming GPUs that uses a sane programming language, C#.
3+
ILGPU provides an interface for programming GPUs that uses a sane programming language, C#.
44
ILGPU takes your normal C# code (perhaps with a few small changes) and transforms it into either
5-
OpenCL or PTX (think CUDA assembly). This combines all the power, flexibility, and performance of
6-
CUDA / OpenCL with the ease of use of C#.
5+
OpenCL or PTX (think CUDA assembly). This combines all the power, flexibility, and performance of
6+
CUDA / OpenCL with the ease of use of C#.
77

88
# Setting up ILGPU.
99

1010
This tutorial is a little different now because we are going to be looking at the ILGPU 1.0.0.
1111

12-
ILGPU should work on any 64-bit platform that .Net supports. I have even used it on the inexpensive nvidia jetson nano with pretty decent cuda performance.
12+
ILGPU should work on any 64-bit platform that .Net supports. I have even used it on the inexpensive nvidia jetson nano
13+
with pretty decent cuda performance.
1314

14-
Technically ILGPU supports F# but I don't use F# enough to really tutorialize it. I will be sticking to C# in these tutorials.
15+
Technically ILGPU supports F# but I don't use F# enough to really tutorialize it. I will be sticking to C# in these
16+
tutorials.
1517

1618
### High level setup steps.
1719

1820
If enough people care I can record a short video of this process, but I expect this will be enough for most programmers.
1921

20-
1. Install the most recent [.Net SDK](https://dotnet.microsoft.com/download/visual-studio-sdks) for your chosen platform.
22+
1. Install the most recent [.Net SDK](https://dotnet.microsoft.com/download/visual-studio-sdks) for your chosen
23+
platform.
2124
2. Create a new C# project.
22-
![dotnet new console](Images/newProject.png?raw=true)
25+
![dotnet new console](Images/newProject.png?raw=true)
2326
3. Add the ILGPU package
24-
![dotnet add package ILGPU](Images/beta.png?raw=true)
27+
![dotnet add package ILGPU](Images/beta.png?raw=true)
2528
4. ??????
2629
5. Profit
2730

@@ -30,5 +33,5 @@ If enough people care I can record a short video of this process, but I expect t
3033
If you would like more info about GPGPU I would recommend the following resources.
3134

3235
* [The Cuda docs](https://developer.nvidia.com/about-cuda) / [OpenCL docs](https://www.khronos.org/opencl/)
33-
* [An Introduction to CUDA Programming | Video 5 min](https://www.youtube.com/watch?v=kIyCq6awClM)
34-
* [Introduction to GPU Architecture and Programming Models | Video 2 hr 14 min](https://www.youtube.com/watch?v=uvVy3CqpVbM)
36+
* [An Introduction to CUDA Programming - 5min](https://www.youtube.com/watch?v=kIyCq6awClM)
37+
* [Introduction to GPU Architecture and Programming Models - 2h 14min](https://www.youtube.com/watch?v=uvVy3CqpVbM)

Docs/Primer_01.md Docs/01_Primers/02_A-GPU-Is-Not-A-CPU.md

+28-18
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
1-
# Primer 01: Code
1+
# Primer 01: Code
2+
23
This page will provide a quick rundown of the basics of how kernels (think GPU programs) run.
34
If you are already familiar with CUDA or OpenCL programs you can probably skip this.
45

@@ -10,32 +11,34 @@ To steal a quote from a very good [talk](https://www.youtube.com/watch?v=uvVy3Cq
1011
>
1112
> 2. Data Locality
1213
>
13-
> 3. Threading
14+
> 3. Threading
1415
1516
## A GPU is not a CPU
17+
1618
If you will allow a little bit of **massive oversimplification**, this is pretty easy to understand.
1719

1820
### How does a CPU work?
1921

20-
A traditional processor has a very simple cycle: fetch, decode, execute.
22+
A traditional processor has a very simple cycle: fetch, decode, execute.
2123

22-
It grabs an instruction from memory (the fetch), figures out how to perform said instruction (the decode),
24+
It grabs an instruction from memory (the fetch), figures out how to perform said instruction (the decode),
2325
and does the instruction (the execute). This cycle then repeats for all the instructions in your algorithm.
2426
Executing this linear stream of instructions is fine for most programs because CPUs are super fast, and most
2527
algorithms are serial.
2628

2729
What happens when you have an algorithm that can be processed in parallel? A CPU has multiple cores, each
28-
doing its own fetch, decode, execute. You can spread the algorithm across all the cores on the CPU, but
30+
doing its own fetch, decode, execute. You can spread the algorithm across all the cores on the CPU, but
2931
in the end each core will still be running a stream of instructions, likely the *same* stream of instructions,
3032
but with *different* data.
3133

3234
GPUs and CPUs both try to exploit this fact, but use two very different methods.
3335

3436
##### CPU | SIMD: Single Instruction Multiple Data.
37+
3538
CPUs have a trick for parallel programs called SIMD. These are a set of instructions
3639
that allow you to have one instuction do operations on multiple pieces of data at once.
3740

38-
Lets say a CPU has an add instruction:
41+
Lets say a CPU has an add instruction:
3942
> ADD RegA RegB
4043
4144
Which would perform
@@ -46,9 +49,9 @@ The SIMD version would be:
4649
4750
Which would perform
4851
> RegA = RegE + RegA
49-
>
52+
>
5053
> RegB = RegF + RegB
51-
>
54+
>
5255
> RegC = RegG + RegC
5356
>
5457
> RegD = RegH + RegD
@@ -59,28 +62,31 @@ A clever programmer can take these instructions and get a 3x-8x performance impr
5962
in very math heavy scenarios.
6063

6164
##### GPU | SIMT: Single Instruction Multiple Threads.
65+
6266
GPUs have SIMT. SIMT is the same idea as SIMD but instead of just doing the math instructions
63-
in parallel why not do **all** the instructions in parallel.
67+
in parallel why not do **all** the instructions in parallel.
6468

65-
The GPU assumes all the instructions you are going to fetch and decode for 32 threads are
66-
the same, it does 1 fetch and decode to setup 32 execute steps, then it does all 32 execute
67-
steps at once. This allows you to get 32 way multithreading per single core, if and only
68-
if all 32 threads want to do the same instruction.
69+
The GPU assumes all the instructions you are going to fetch and decode for 32 threads are
70+
the same, it does 1 fetch and decode to setup 32 execute steps, then it does all 32 execute
71+
steps at once. This allows you to get 32 way multithreading per single core, if and only
72+
if all 32 threads want to do the same instruction.
6973

7074
### Kernels
75+
7176
With this knowledge we can now talk about kernels. Kernels are just GPU programs, but because
72-
a GPU program is not a single thread, but many, it works a little different.
77+
a GPU program is not a single thread, but many, it works a little different.
7378

7479
When I was first learning about kernels I had an observation that made kernels kinda *click*
75-
in my head.
80+
in my head.
7681

77-
Kernels and Parallel.For have the same usage pattern.
82+
Kernels and Parallel.For have the same usage pattern.
7883

7984
If you don't know about Parallel.For it is a function that provides a really easy way to run
80-
code on every core of the CPU. All you do is pass in the start index, an end index, and a function
85+
code on every core of the CPU. All you do is pass in the start index, an end index, and a function
8186
that takes an index. Then the function is called from some thread with an index. There are no guarantees
8287
about what core an index is run on, or what order the threads are run, but you get a **very** simple
8388
interface for running parallel functions.
89+
8490
```c#
8591
using System;
8692
using System.Threading.Tasks;
@@ -102,7 +108,9 @@ public static class Program
102108
}
103109
}
104110
```
111+
105112
Running the same program as a kernel is **very** similar:
113+
106114
```c#
107115
using ILGPU;
108116
using ILGPU.Runtime;
@@ -139,7 +147,9 @@ public static class Program
139147
}
140148
}
141149
```
150+
142151
You do not need to understand what is going on in the kernel example to see that the Parallel.For code uses the same
143152
API. The major differences are due to how memory is handled.
144153

145-
Parallel.For and Kernels both have the same potential for race conditions, and for each you must take care to prevent them.
154+
Parallel.For and Kernels both have the same potential for race conditions, and for each you must take care to prevent
155+
them.
Original file line numberDiff line numberDiff line change
@@ -1,57 +1,63 @@
1-
# Primer 02: Memory
1+
# Primer 02: Memory
22

3-
The following is my understanding of the performance quirks with GPUs due to memory and cache and coalescent memory access.
3+
The following is my understanding of the performance quirks with GPUs due to memory and cache and coalescent memory
4+
access.
45
Just like with Primer 01, if you have a decent understanding of CUDA or OpenCL you can skip this.
56

67
Ok, buckle up.
78

89
## Memory and bandwidth and threads. Oh my!
910

1011
### Computers need memory, and memory is slow<sup>0</sup>. (Like, really slow)
12+
1113
Back in the day (I assume, the first computer I remember using had DDR-200) computer memory
12-
was FAST. Most of the time the limiting factor was the CPU, though correctly timing video output was also
13-
a driving force. As an example, the C64 ran the memory at 2x the CPU frequency so the VIC-II
14-
graphics chip could share the CPU memory by stealing half the cycles. In the almost 40 years since the C64, humanity
15-
has gotten much better at making silicon and precious metals do our bidding. Feeding
14+
was FAST. Most of the time the limiting factor was the CPU, though correctly timing video output was also
15+
a driving force. As an example, the C64 ran the memory at 2x the CPU frequency so the VIC-II
16+
graphics chip could share the CPU memory by stealing half the cycles. In the almost 40 years since the C64, humanity
17+
has gotten much better at making silicon and precious metals do our bidding. Feeding
1618
data into the CPU from memory has become the slow part. Memory is slow.
1719

1820
Why is memory slow? To be honest, it seems to me that it's caused by two things:
1921

2022
1. Physics<br/>
21-
Programmers like to think of computers as an abstract thing, a platonic ideal.
22-
But here in the real world there are no spherical cows, no free lunch. Memory values are ACTUAL
23-
ELECTRONS traveling through silicon and precious metals.
23+
Programmers like to think of computers as an abstract thing, a platonic ideal.
24+
But here in the real world there are no spherical cows, no free lunch. Memory values are ACTUAL
25+
ELECTRONS traveling through silicon and precious metals.
2426

2527
In general, the farther from the thing doing the math the ACTUAL ELECTRONS are the slower it is
2628
to access.
2729

2830
2. We ~~need~~ want a lot of memory.<br/>
29-
We can make memory that is almost as fast as our processors, but it must literally be directly made into the processor cores in silicon.
30-
Not only is this is very expensive, the more memory in silicon the less room for processor stuff.
31+
We can make memory that is almost as fast as our processors, but it must literally be directly made into the
32+
processor cores in silicon.
33+
Not only is this is very expensive, the more memory in silicon the less room for processor stuff.
3134

3235
### How do processors deal with slow memory?
3336

34-
This leads to an optimization problem. Modern processor designers use a complex system of tiered
37+
This leads to an optimization problem. Modern processor designers use a complex system of tiered
3538
memory consisting of several layers of small, fast, on-die memory and large, slow, distant, off-die memory.
3639

37-
A processor can also perform a few tricks to help us deal with the fact that memory is slow.
38-
One example is prefetching. If a program uses memory at location X it probably will use the
39-
memory at location X+1 therefore the processor *prefetchs* a whole chunk of memory and puts it in
40-
the cache, closer to the processor. This way if you do need the memory at X+1 it is already in cache.
40+
A processor can also perform a few tricks to help us deal with the fact that memory is slow.
41+
One example is prefetching. If a program uses memory at location X it probably will use the
42+
memory at location X+1 therefore the processor *prefetchs* a whole chunk of memory and puts it in
43+
the cache, closer to the processor. This way if you do need the memory at X+1 it is already in cache.
4144

42-
I am getting off topic. For a more detailed explanation, see this thing I found on [google](https://formulusblack.com/blog/compute-performance-distance-of-data-as-a-measure-of-latency/).
45+
I am getting off topic. For a more detailed explanation, see this thing I found
46+
on [google](https://formulusblack.com/blog/compute-performance-distance-of-data-as-a-measure-of-latency/).
4347

4448
# What does this mean for ILGPU?
4549

46-
#### GPU's have memory, and memory is slow.
50+
#### GPU's have memory, and memory is slow.
4751

48-
GPUs on paper have TONS of memory bandwidth, my GPU has around 10x the memory bandwidth my CPU does. Right? Yeah...
52+
GPUs on paper have TONS of memory bandwidth, my GPU has around 10x the memory bandwidth my CPU does. Right? Yeah...
4953

5054
###### Kinda
51-
If we go back into spherical cow territory and ignore a ton of important details, we can illustrate an
55+
56+
If we go back into spherical cow territory and ignore a ton of important details, we can illustrate an
5257
important quirk in GPU design that directly impacts performance.
5358

54-
My CPU, a Ryzen 5 3600 with dual channel DDR4, gets around 34 GB/s of memory bandwidth. The GDDR6 in my GPU, a RTX 2060, gets around 336 GB/s of memory bandwidth.
59+
My CPU, a Ryzen 5 3600 with dual channel DDR4, gets around 34 GB/s of memory bandwidth. The GDDR6 in my GPU, a RTX 2060,
60+
gets around 336 GB/s of memory bandwidth.
5561

5662
But lets compare bandwidth per thread.
5763

@@ -60,25 +66,34 @@ CPU: Ryzen 5 3600 34 GB/s / 12 threads = 2.83 GB/s per thread
6066
GPU: RTX 2060 336 GB/s / (30 SM's * 512 threads<sup>1</sup>) = 0.0218 GB/s or just *22.4 MB/s per thread*
6167

6268
#### So what?
63-
In the end computers need memory because programs need memory. There are a few things I think about as I program that I think help:
6469

65-
1. If your code scans through memory linearly the GPU can optimize it by prefetching the data. This leads to the "struct of arrays"
66-
approach, more on that in the structs tutorial.
67-
2. GPUs take prefetching to the next level by having coalescent memory access, which I need a more in depth explanation of, but
68-
basically if threads are accessing memory in a linear way that the GPU can detect, it can send one memory access for the whole chunk
69-
of threads.
70+
In the end computers need memory because programs need memory. There are a few things I think about as I program that I
71+
think help:
72+
73+
1. If your code scans through memory linearly the GPU can optimize it by prefetching the data. This leads to the "struct
74+
of arrays"
75+
approach, more on that in the structs tutorial.
76+
2. GPUs take prefetching to the next level by having coalescent memory access, which I need a more in depth explanation
77+
of, but
78+
basically if threads are accessing memory in a linear way that the GPU can detect, it can send one memory access for
79+
the whole chunk
80+
of threads.
7081

71-
Again, this all boils down to the very simple notion that memory is slow, and it gets slower the farther it gets from the processor.
82+
Again, this all boils down to the very simple notion that memory is slow, and it gets slower the farther it gets from
83+
the processor.
7284

7385
> <sup>0</sup>
7486
> This is obviously a complex topic. In general, modern memory bandwidth has a speed, and a latency problem. They
75-
> are different, but in subtle ways. If you are interested in this I would do some more research, I am just
87+
> are different, but in subtle ways. If you are interested in this I would do some more research, I am just
7688
> some random dude on the internet.
7789
7890
> <sup>1</sup>
79-
> I thought this would be simple, but after double checking, I found that the question "How many threads can a GPU run at once?"
80-
> is a hard question and also the wrong question to answer. According to the cuda manual, at maximum an SM (Streaming Multiprocessor) can
81-
> have 16 warps executing simultaneously and 32 threads per warp so it can issue at minimum 512 memory accesses per
82-
> cycle. You may have more warps scheduled due to memory / instruction latency but a minimum estimate will do. This still provides a good
83-
> illustration for how little memory bandwidth you have per thread. We will get into more detail in a
84-
> grouping tutorial.
91+
> I thought this would be simple, but after double checking, I found that the question "How many threads can a GPU run
92+
> at once?"
93+
> is a hard question and also the wrong question to answer. According to the cuda manual, at maximum an SM (Streaming
94+
> Multiprocessor) can
95+
> have 16 warps executing simultaneously and 32 threads per warp so it can issue at minimum 512 memory accesses per
96+
> cycle. You may have more warps scheduled due to memory / instruction latency but a minimum estimate will do. This
97+
> still provides a good
98+
> illustration for how little memory bandwidth you have per thread. We will get into more detail in a
99+
> grouping tutorial.

0 commit comments

Comments
 (0)