losttech
diff --git a/‎.gitignore
+3 b/‎.gitignore
+3
diff --git a/‎Docs/Primer_00.md ‎Docs/01_Primers/01_Setting-Up-ILGPU.md
+14-11 b/‎Docs/Primer_00.md ‎Docs/01_Primers/01_Setting-Up-ILGPU.md
+14-11
diff --git a/‎Docs/Primer_01.md ‎Docs/01_Primers/02_A-GPU-Is-Not-A-CPU.md
+28-18 b/‎Docs/Primer_01.md ‎Docs/01_Primers/02_A-GPU-Is-Not-A-CPU.md
+28-18
diff --git a/‎Docs/Primer_02.md ‎Docs/01_Primers/03_Memory-and-Bandwidth-Threads.md
+50-35 b/‎Docs/Primer_02.md ‎Docs/01_Primers/03_Memory-and-Bandwidth-Threads.md
+50-35
@@ -246,6 +246,9 @@ ModelManifest.xml
 # Rider
 .idea/
 
+# macOS
+.DS_Store
+
 # Ignore specific template outputs
 Src/ILGPU/AtomicFunctions.cs
 Src/ILGPU/Backends/PTX/PTXIntrinsics.Generated.cs
 
@@ -1,27 +1,30 @@
-# What is ILGPU
+# What is ILGPU
 
-ILGPU provides an interface for programming GPUs that uses a sane programming language, C#. 
+ILGPU provides an interface for programming GPUs that uses a sane programming language, C#.
 ILGPU takes your normal C# code (perhaps with a few small changes) and transforms it into either
-OpenCL or PTX (think CUDA assembly). This combines all the power, flexibility, and performance of 
-CUDA / OpenCL with the ease of use of C#. 
+OpenCL or PTX (think CUDA assembly). This combines all the power, flexibility, and performance of
+CUDA / OpenCL with the ease of use of C#.
 
 # Setting up ILGPU.
 
 This tutorial is a little different now because we are going to be looking at the ILGPU 1.0.0.
 
-ILGPU should work on any 64-bit platform that .Net supports. I have even used it on the inexpensive nvidia jetson nano with pretty decent cuda performance. 
+ILGPU should work on any 64-bit platform that .Net supports. I have even used it on the inexpensive nvidia jetson nano
+with pretty decent cuda performance.
 
-Technically ILGPU supports F# but I don't use F# enough to really tutorialize it. I will be sticking to C# in these tutorials.
+Technically ILGPU supports F# but I don't use F# enough to really tutorialize it. I will be sticking to C# in these
+tutorials.
 
 ### High level setup steps.
 
 If enough people care I can record a short video of this process, but I expect this will be enough for most programmers.
 
-1. Install the most recent [.Net SDK](https://dotnet.microsoft.com/download/visual-studio-sdks) for your chosen platform.
+1. Install the most recent [.Net SDK](https://dotnet.microsoft.com/download/visual-studio-sdks) for your chosen
+   platform.
 2. Create a new C# project.
-![dotnet new console](Images/newProject.png?raw=true)
+   ![dotnet new console](Images/newProject.png?raw=true)
 3. Add the ILGPU package
-![dotnet add package ILGPU](Images/beta.png?raw=true)
+   ![dotnet add package ILGPU](Images/beta.png?raw=true)
 4. ??????
 5. Profit
 
@@ -30,5 +33,5 @@ If enough people care I can record a short video of this process, but I expect t
 If you would like more info about GPGPU I would recommend the following resources.
 
 * [The Cuda docs](https://developer.nvidia.com/about-cuda) / [OpenCL docs](https://www.khronos.org/opencl/)
-* [An Introduction to CUDA Programming | Video 5 min](https://www.youtube.com/watch?v=kIyCq6awClM)
-* [Introduction to GPU Architecture and Programming Models | Video 2 hr 14 min](https://www.youtube.com/watch?v=uvVy3CqpVbM)
+* [An Introduction to CUDA Programming - 5min](https://www.youtube.com/watch?v=kIyCq6awClM)
+* [Introduction to GPU Architecture and Programming Models - 2h 14min](https://www.youtube.com/watch?v=uvVy3CqpVbM)
@@ -1,4 +1,5 @@
-# Primer 01: Code
+# Primer 01: Code
+
 This page will provide a quick rundown of the basics of how kernels (think GPU programs) run.
 If you are already familiar with CUDA or OpenCL programs you can probably skip this.
 
@@ -10,32 +11,34 @@ To steal a quote from a very good [talk](https://www.youtube.com/watch?v=uvVy3Cq
 >
 > 2. Data Locality
 >
-> 3. Threading  
+> 3. Threading
 
 ## A GPU is not a CPU
+
 If you will allow a little bit of **massive oversimplification**, this is pretty easy to understand.
 
 ### How does a CPU work?
 
-A traditional processor has a very simple cycle: fetch, decode, execute. 
+A traditional processor has a very simple cycle: fetch, decode, execute.
 
-It grabs an instruction from memory (the fetch), figures out how to perform said instruction (the decode), 
+It grabs an instruction from memory (the fetch), figures out how to perform said instruction (the decode),
 and does the instruction (the execute). This cycle then repeats for all the instructions in your algorithm.
 Executing this linear stream of instructions is fine for most programs because CPUs are super fast, and most
 algorithms are serial.
 
 What happens when you have an algorithm that can be processed in parallel? A CPU has multiple cores, each
-doing its own fetch, decode, execute. You can spread the algorithm across all the cores on the CPU, but 
+doing its own fetch, decode, execute. You can spread the algorithm across all the cores on the CPU, but
 in the end each core will still be running a stream of instructions, likely the *same* stream of instructions,
 but with *different* data.
 
 GPUs and CPUs both try to exploit this fact, but use two very different methods.
 
 ##### CPU | SIMD: Single Instruction Multiple Data.
+
 CPUs have a trick for parallel programs called SIMD. These are a set of instructions
 that allow you to have one instuction do operations on multiple pieces of data at once.
 
-Lets say a CPU has an add instruction: 
+Lets say a CPU has an add instruction:
 > ADD RegA RegB
 
 Which would perform
@@ -46,9 +49,9 @@ The SIMD version would be:
 
 Which would perform
 > RegA = RegE + RegA
-> 
+>
 > RegB = RegF + RegB
-> 
+>
 > RegC = RegG + RegC
 >
 > RegD = RegH + RegD
@@ -59,28 +62,31 @@ A clever programmer can take these instructions and get a 3x-8x performance impr
 in very math heavy scenarios.
 
 ##### GPU | SIMT: Single Instruction Multiple Threads.
+
 GPUs have SIMT. SIMT is the same idea as SIMD but instead of just doing the math instructions
- in parallel why not do **all** the instructions in parallel. 
+in parallel why not do **all** the instructions in parallel.
 
-The GPU assumes all the instructions you are going to fetch and decode for 32 threads are 
-the same, it does 1 fetch and decode to setup 32 execute steps, then it does all 32 execute 
-steps at once. This allows you to get 32 way multithreading per single core, if and only 
-if all 32 threads want to do the same instruction. 
+The GPU assumes all the instructions you are going to fetch and decode for 32 threads are
+the same, it does 1 fetch and decode to setup 32 execute steps, then it does all 32 execute
+steps at once. This allows you to get 32 way multithreading per single core, if and only
+if all 32 threads want to do the same instruction.
 
 ### Kernels
+
 With this knowledge we can now talk about kernels. Kernels are just GPU programs, but because
- a GPU program is not a single thread, but many, it works a little different. 
+a GPU program is not a single thread, but many, it works a little different.
 
 When I was first learning about kernels I had an observation that made kernels kinda *click*
-in my head. 
+in my head.
 
-Kernels and Parallel.For have the same usage pattern. 
+Kernels and Parallel.For have the same usage pattern.
 
 If you don't know about Parallel.For it is a function that provides a really easy way to run
- code on every core of the CPU. All you do is pass in the start index, an end index, and a function
+code on every core of the CPU. All you do is pass in the start index, an end index, and a function
 that takes an index. Then the function is called from some thread with an index. There are no guarantees
 about what core an index is run on, or what order the threads are run, but you get a **very** simple
 interface for running parallel functions.
+
 ```c#
 using System;
 using System.Threading.Tasks;
@@ -102,7 +108,9 @@ public static class Program
     }
 }
 ```
+
 Running the same program as a kernel is **very** similar:
+
 ```c#
 using ILGPU;
 using ILGPU.Runtime;
@@ -139,7 +147,9 @@ public static class Program
     }
 }
 ```
+
 You do not need to understand what is going on in the kernel example to see that the Parallel.For code uses the same
 API. The major differences are due to how memory is handled.
 
-Parallel.For and Kernels both have the same potential for race conditions, and for each you must take care to prevent them.
+Parallel.For and Kernels both have the same potential for race conditions, and for each you must take care to prevent
+them.
@@ -1,57 +1,63 @@
-# Primer 02: Memory
+# Primer 02: Memory
 
-The following is my understanding of the performance quirks with GPUs due to memory and cache and coalescent memory access.
+The following is my understanding of the performance quirks with GPUs due to memory and cache and coalescent memory
+access.
 Just like with Primer 01, if you have a decent understanding of CUDA or OpenCL you can skip this.
 
 Ok, buckle up.
 
 ## Memory and bandwidth and threads. Oh my!
 
 ### Computers need memory, and memory is slow<sup>0</sup>. (Like, really slow)
+
 Back in the day (I assume, the first computer I remember using had DDR-200) computer memory
- was FAST. Most of the time the limiting factor was the CPU, though correctly timing video output was also
-a driving force. As an example, the C64 ran the memory at 2x the CPU frequency so the VIC-II 
-graphics chip could share the CPU memory by stealing half the cycles. In the almost 40 years since the C64, humanity 
-has gotten much better at making silicon and precious metals do our bidding. Feeding 
+was FAST. Most of the time the limiting factor was the CPU, though correctly timing video output was also
+a driving force. As an example, the C64 ran the memory at 2x the CPU frequency so the VIC-II
+graphics chip could share the CPU memory by stealing half the cycles. In the almost 40 years since the C64, humanity
+has gotten much better at making silicon and precious metals do our bidding. Feeding
 data into the CPU from memory has become the slow part. Memory is slow.
 
 Why is memory slow? To be honest, it seems to me that it's caused by two things:
 
 1. Physics<br/>
-Programmers like to think of computers as an abstract thing, a platonic ideal. 
-But here in the real world there are no spherical cows, no free lunch. Memory values are ACTUAL
-ELECTRONS traveling through silicon and precious metals. 
+   Programmers like to think of computers as an abstract thing, a platonic ideal.
+   But here in the real world there are no spherical cows, no free lunch. Memory values are ACTUAL
+   ELECTRONS traveling through silicon and precious metals.
 
 In general, the farther from the thing doing the math the ACTUAL ELECTRONS are the slower it is
 to access.
 
 2. We ~~need~~ want a lot of memory.<br/>
-We can make memory that is almost as fast as our processors, but it must literally be directly made into the processor cores in silicon. 
-Not only is this is very expensive, the more memory in silicon the less room for processor stuff. 
+   We can make memory that is almost as fast as our processors, but it must literally be directly made into the
+   processor cores in silicon.
+   Not only is this is very expensive, the more memory in silicon the less room for processor stuff.
 
 ### How do processors deal with slow memory?
 
-This leads to an optimization problem. Modern processor designers use a complex system of tiered 
+This leads to an optimization problem. Modern processor designers use a complex system of tiered
 memory consisting of several layers of small, fast, on-die memory and large, slow, distant, off-die memory.
 
-A processor can also perform a few tricks to help us deal with the fact that memory is slow. 
-One example is prefetching. If a program uses memory at location X it probably will use the 
-memory at location X+1 therefore the processor *prefetchs* a whole chunk of memory and puts it in 
-the cache, closer to the processor. This way if you do need the memory at X+1 it is already in cache. 
+A processor can also perform a few tricks to help us deal with the fact that memory is slow.
+One example is prefetching. If a program uses memory at location X it probably will use the
+memory at location X+1 therefore the processor *prefetchs* a whole chunk of memory and puts it in
+the cache, closer to the processor. This way if you do need the memory at X+1 it is already in cache.
 
-I am getting off topic. For a more detailed explanation, see this thing I found on [google](https://formulusblack.com/blog/compute-performance-distance-of-data-as-a-measure-of-latency/).
+I am getting off topic. For a more detailed explanation, see this thing I found
+on [google](https://formulusblack.com/blog/compute-performance-distance-of-data-as-a-measure-of-latency/).
 
 # What does this mean for ILGPU?
 
-#### GPU's have memory, and memory is slow. 
+#### GPU's have memory, and memory is slow.
 
-GPUs on paper have TONS of memory bandwidth, my GPU has around 10x the memory bandwidth my CPU does. Right? Yeah... 
+GPUs on paper have TONS of memory bandwidth, my GPU has around 10x the memory bandwidth my CPU does. Right? Yeah...
 
 ###### Kinda
-If we go back into spherical cow territory and ignore a ton of important details, we can illustrate an 
+
+If we go back into spherical cow territory and ignore a ton of important details, we can illustrate an
 important quirk in GPU design that directly impacts performance.
 
-My CPU, a Ryzen 5 3600 with dual channel DDR4, gets around 34 GB/s of memory bandwidth. The GDDR6 in my GPU, a RTX 2060, gets around 336 GB/s of memory bandwidth.
+My CPU, a Ryzen 5 3600 with dual channel DDR4, gets around 34 GB/s of memory bandwidth. The GDDR6 in my GPU, a RTX 2060,
+gets around 336 GB/s of memory bandwidth.
 
 But lets compare bandwidth per thread.
 
@@ -60,25 +66,34 @@ CPU: Ryzen 5 3600 34 GB/s / 12 threads = 2.83 GB/s per thread
 GPU: RTX 2060 336 GB/s / (30 SM's * 512 threads<sup>1</sup>) = 0.0218 GB/s or just *22.4 MB/s per thread*
 
 #### So what?
-In the end computers need memory because programs need memory. There are a few things I think about as I program that I think help:
 
-1. If your code scans through memory linearly the GPU can optimize it by prefetching the data. This leads to the "struct of arrays"
- approach, more on that in the structs tutorial.
-2. GPUs take prefetching to the next level by having coalescent memory access, which I need a more in depth explanation of, but
-basically if threads are accessing memory in a linear way that the GPU can detect, it can send one memory access for the whole chunk
-of threads. 
+In the end computers need memory because programs need memory. There are a few things I think about as I program that I
+think help:
+
+1. If your code scans through memory linearly the GPU can optimize it by prefetching the data. This leads to the "struct
+   of arrays"
+   approach, more on that in the structs tutorial.
+2. GPUs take prefetching to the next level by having coalescent memory access, which I need a more in depth explanation
+   of, but
+   basically if threads are accessing memory in a linear way that the GPU can detect, it can send one memory access for
+   the whole chunk
+   of threads.
 
-Again, this all boils down to the very simple notion that memory is slow, and it gets slower the farther it gets from the processor.
+Again, this all boils down to the very simple notion that memory is slow, and it gets slower the farther it gets from
+the processor.
 
 > <sup>0</sup>
 > This is obviously a complex topic. In general, modern memory bandwidth has a speed, and a latency problem. They
-> are different, but in subtle ways. If you are interested in this I would do some more research, I am just 
+> are different, but in subtle ways. If you are interested in this I would do some more research, I am just
 > some random dude on the internet.
 
 > <sup>1</sup>
-> I thought this would be simple, but after double checking, I found that the question "How many threads can a GPU run at once?"
->  is a hard question and also the wrong question to answer. According to the cuda manual, at maximum an SM (Streaming Multiprocessor) can 
-> have 16 warps executing simultaneously and 32 threads per warp so it can issue at minimum 512 memory accesses per 
-> cycle. You may have more warps scheduled due to memory / instruction latency but a minimum estimate will do. This still provides a good
-> illustration for how little memory bandwidth you have per thread. We will get into more detail in a 
-> grouping tutorial.
+> I thought this would be simple, but after double checking, I found that the question "How many threads can a GPU run
+> at once?"
+> is a hard question and also the wrong question to answer. According to the cuda manual, at maximum an SM (Streaming
+> Multiprocessor) can
+> have 16 warps executing simultaneously and 32 threads per warp so it can issue at minimum 512 memory accesses per
+> cycle. You may have more warps scheduled due to memory / instruction latency but a minimum estimate will do. This
+> still provides a good
+> illustration for how little memory bandwidth you have per thread. We will get into more detail in a
+> grouping tutorial.