CIS565-Fall-2022
diff --git a/‎README.md‎
Lines changed: 58 additions & 23 deletions b/‎README.md‎
Lines changed: 58 additions & 23 deletions
diff --git a/‎img/random.jpg‎
24.5 KB b/‎img/random.jpg‎
24.5 KB
diff --git a/‎img/sobol.jpg‎
27.5 KB b/‎img/sobol.jpg‎
27.5 KB
diff --git a/‎img/sorted_no_sorted_camera.png‎
27.2 KB b/‎img/sorted_no_sorted_camera.png‎
27.2 KB
diff --git a/‎src/common.h‎
Lines changed: 1 addition & 1 deletion b/‎src/common.h‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎src/pathtrace.cu‎
Lines changed: 6 additions & 8 deletions b/‎src/pathtrace.cu‎
Lines changed: 6 additions & 8 deletions
@@ -12,7 +12,7 @@ CUDA Path Tracer
 
 ## Introduction
 
-This is our third project of CIS 565 Fall 2022. In this project, our goal is to implement a GPU-accelerated ray tracer with CUDA. 
+This is our third project of CIS 565 Fall 2022. In this project, we implemented a GPU-based path tracer with stream compaction
 
 ## Representative Outcome
 
@@ -43,23 +43,31 @@ This is our third project of CIS 565 Fall 2022. In this project, our goal is to
 
 #### Direct Lighting with Multiple Importance Sampling
 
-#### Importance Sampled Skybox (Environment Map)
 
-Tired of "virtual artificial" light sources? Let's introduce some real-world li
+
+#### Importance Sampled HDR Environment Map (Skybox)
+
+Tired of "virtual artificial" light sources? Let's introduce some real-world lighting.
 
 #### Physically-Based Materials
 
+##### Lambertian Diffuse
+
+##### Metallic Workflow: Expressive and Artist-Friendly
+
 #### Normal Map & PBR Texture
 
-#### Physically-Based Camera: Depth of Field & Custom Bokeh Shape
+
+
+#### Physically-Based Camera: Depth of Field & Custom Bokeh Shape & Panorama
 
 This is my favorite part of the project.
 
 | No DOF                      | DOF                     |
 | --------------------------- | ----------------------- |
 | ![](./img/aperture_off.jpg) | ![](./img/aperture.jpg) |
 
-This idea can even be extended by stochastically sampling a masked aperture image instead of the whole aperture disk. 
+This idea can even be extended by stochastically sampling a mask image instead of the whole disk area of the aperture:
 
 <div align="center">
 	<img src="./scenes/texture/star3.jpg" width="15%"/>
@@ -70,69 +78,96 @@ This idea can even be extended by stochastically sampling a masked aperture imag
 | ------------------------------ | ----------------------------- |
 | ![](./img/aperture_custom.jpg) | ![](./img/aperture_heart.jpg) |
 
-#### Efficiently Sampling: Sobol Low Discrepancy Sequence
+Which creates very interesting and amazing results.
+
+#### Efficient Sampling: Sobol Quasi-Monte Carlo Sequence
+
+In path tracing or any other Monte Carlo-based light transport algorithms, apart from improving
+
+the performance from a point of view of programming, we can also improve it mathematically. Quasi-Monte Carlo sequence is a class of quasi-random sequence that is widely used in Monte Carlo simulation. This kind of sequence is mathematically proved to be more efficient than pseudorandom sequences (like what `thrust::default_random_engine` generates).
+
+Theoretically, to maximize the benefit of Sobol sequence, we need to generate unique sequences for every pixel during each sampling iteration at real-time -- this is not trivial. Not to say that computing each number requires at most 32 bit loops. A better choice would be precomputing one pixel's sequence, then use some sort of perturbation to produce different sequences for different pixels.
+
+Here is the result I get from testing the untextured [PBR texture scene](#representative-outcome). With the same number of samples per pixel, path tracing with Sobol sequence produces much less noise (lower variance).
 
 | Pseudorandom Sequence        | Xor-Scrambled Sobol Sequence |
 | ---------------------------- | ---------------------------- |
 | ![](./img/sampler_indep.jpg) | ![](./img/sampler_sobol.jpg) |
 
 
 
+
+
 #### Post Processing
 
 ##### Gamma Correction
 
-##### Tone Mapping
+Implementing gamma correction is very trivial. But it is necessary if we want our final image to be correctly displayed on monitors, through which we see by our eyes.
 
----
+##### Tone Mapping
 
 
 
 ### Performance
 
-#### Fast Intersection: Stackless SAH-Constructed BVH
+#### Fast Intersection: Stackless SAH-Based Bounding Volume Hierarchy
 
-For ray-scene intersection, I did two levels of optimization.
+Ray-scene intersection is probably the best time consuming part of 
 
-First, I wrote a SAH-based BVH. SAH, the Surface Area Heuristic is a method to decide how to split a set of bounding volumes 
+I did two levels of optimization.
 
-The second level of optimization
+##### Better Tree Structure: Surface Area Heuristic 
 
-#### Single-Kernel Path Tracing
+First, I implemented a SAH-based BVH. SAH, the Surface Area Heuristic, is a method to determine how to split a set of bounding volumes into subsets when constructing a BVH, that the constructed tree's structure would be highly optimal.
+
+##### Faster Tree Traversal on GPU: Multiple-Threaded BVH
 
-There is a paper . It had an interesting opinion: instead of 
+The second level of optimization is done on GPU. BVH is a tree after all, so we still have to traverse through it during ray-scene intersection even on GPU.
 
 ### Other
 
-#### Streamed Path Tracing Using Stream Compaction
+#### Single-Kernel Path Tracing
+
+To figure out how much stream compaction can possibly improve a GPU path tracer's performance, we need a baseline to compare with. Instead of toggling streamed path tracer's kernel to disable stream compaction, we can separately write another kernel that does the entire ray tracing process. That is, we shoot rays, find intersection, shading surfaces and sampling new rays in one kernel.
 
 #### First Ray Caching (G-Buffer)
 
-Since I implemented anti-aliasing and physically based camera at the very beginning, when I noticed that there is still a requirement in the basic part, I found it 
+In real-time rendering, a technique called deferred shading stores scene's geometry information in texture buffers (G-Buffer) at the beginning of render pass, so that . It turns out we can do something similar with offline rendering. 
 
 ## Performance Analysis
 
-### How Much GPU Improves Path Tracing Efficiency
+### Why My Multi-Kernel Streamed Path Tracer Not Always Faster Than Single-Kernel?
 
-I'm able and confident to answer this question because I have one CPU path tracer from undergrad. 
+What got me surprised it wasn't that efficient as expected. In some scenes, it was even worse than the single kernel path tracer. 
 
-### Why My Multi-Kernel Streamed Path Tracer Not Faster Than Single-Kernel?
+In general, it's a tradeoff between thread concurrency and time spent accessing global memory.
 
-To know how streaming the rays can improve path tracing efficiency, I additionally implemented a single-kernel version of this path tracer.
+There is a paper stressing this point, from which I also got the idea to additionally implement a single kernel tracer
 
-What got me surprised it wasn't efficient as expected. In some scenes, it was even worse.
+- [[Progressive Light Transport Simulation on the GPU: Survey and Improvements]](https://cgg.mff.cuni.cz/~jaroslav/papers/2014-gpult/2014-gpult-paper.pdf)
 
-Using NSight Compute, I inspected 
 
-In general, it's a tradeoff between thread concurrency and time spent accessing global memory.
 
 ### Material Sorting: Why Slower
 
+After implementing material sorting, I found it actually slower. And not by a little bit, but very significantly. With NSight Compute, I got to inspect how much time each kernel takes before and after enabling material sorting.
+
+Like what the figure below shows, sorting materials does improve memory coalescing for intersection, sampling and stream compaction (I grouped sampling and lighting together because I did direct lighting). However, the effect is not sufficient to tradeoff the additional time introduced with sorting at all. As we can see the test result below, sorting makes up more than 1/3 of ray tracing time.
+
+![](./img/sorted_no_sorted_camera.png)
+
 Or, there is another possibility that BSDF sampling and evaluation is not that time consuming as expected. The bottleneck still lies in traversal of acceleration structure.
 
 Therefore, in my opinion, material sorting is best applied when:
 
 - There are many different materials in the scene
 - Primitives sharing the same material are randomly distributed in many small clusters over the scene space. The clusters' sizes in solid angle are typically less than what a GPU warp can cover
 
+
+
+### How Much GPU Improves Path Tracing Efficiency Compared to CPU
+
 ### Image Texture vs. Procedural Texture
+
+
+
@@ -1,7 +1,7 @@
 #pragma once
 #include <iostream>
 
-#define SAMPLER_USE_SOBOL true
+#define SAMPLER_USE_SOBOL false
 
 #define SCENE_LIGHT_SINGLE_SIDED true
 
 
@@ -597,10 +597,8 @@ struct RemoveInvalidPaths {
 	}
 };
 
-/**
- * Wrapper for the __global__ call that sets up the kernel calls and does a ton
- * of memory management
- */
+static 
+
 void pathTrace(uchar4* pbo, int frame, int iter) {
 	const Camera& cam = hstScene->camera;
 	const int pixelCount = cam.resolution.x * cam.resolution.y;
@@ -611,16 +609,16 @@ void pathTrace(uchar4* pbo, int frame, int iter) {
 		(cam.resolution.x + blockSize2D.x - 1) / blockSize2D.x,
 		(cam.resolution.y + blockSize2D.y - 1) / blockSize2D.y);
 
-	generateRayFromCamera<<<blocksPerGrid2D, blockSize2D>>>(hstScene->devScene, cam, iter, Settings::traceDepth, devPaths);
-	checkCUDAError("PT::generateRayFromCamera");
-	cudaDeviceSynchronize();
-
 	int depth = 0;
 	int numPaths = pixelCount;
 
 	auto devTerminatedThr = devTerminatedPathsThr;
 
 	if (Settings::tracer == Tracer::Streamed) {
+		generateRayFromCamera<<<blocksPerGrid2D, blockSize2D>>>(hstScene->devScene, cam, iter, Settings::traceDepth, devPaths);
+		checkCUDAError("PT::generateRayFromCamera");
+		cudaDeviceSynchronize();
+
 		bool iterationComplete = false;
 		while (!iterationComplete) {
 			// clean shading chunks