Skip to content

Commit a57b32a

Browse files
committed
update readme
1 parent 122720e commit a57b32a

File tree

6 files changed

+65
-32
lines changed

6 files changed

+65
-32
lines changed

README.md

Lines changed: 58 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ CUDA Path Tracer
1212

1313
## Introduction
1414

15-
This is our third project of CIS 565 Fall 2022. In this project, our goal is to implement a GPU-accelerated ray tracer with CUDA.
15+
This is our third project of CIS 565 Fall 2022. In this project, we implemented a GPU-based path tracer with stream compaction
1616

1717
## Representative Outcome
1818

@@ -43,23 +43,31 @@ This is our third project of CIS 565 Fall 2022. In this project, our goal is to
4343

4444
#### Direct Lighting with Multiple Importance Sampling
4545

46-
#### Importance Sampled Skybox (Environment Map)
4746

48-
Tired of "virtual artificial" light sources? Let's introduce some real-world li
47+
48+
#### Importance Sampled HDR Environment Map (Skybox)
49+
50+
Tired of "virtual artificial" light sources? Let's introduce some real-world lighting.
4951

5052
#### Physically-Based Materials
5153

54+
##### Lambertian Diffuse
55+
56+
##### Metallic Workflow: Expressive and Artist-Friendly
57+
5258
#### Normal Map & PBR Texture
5359

54-
#### Physically-Based Camera: Depth of Field & Custom Bokeh Shape
60+
61+
62+
#### Physically-Based Camera: Depth of Field & Custom Bokeh Shape & Panorama
5563

5664
This is my favorite part of the project.
5765

5866
| No DOF | DOF |
5967
| --------------------------- | ----------------------- |
6068
| ![](./img/aperture_off.jpg) | ![](./img/aperture.jpg) |
6169

62-
This idea can even be extended by stochastically sampling a masked aperture image instead of the whole aperture disk.
70+
This idea can even be extended by stochastically sampling a mask image instead of the whole disk area of the aperture:
6371

6472
<div align="center">
6573
<img src="./scenes/texture/star3.jpg" width="15%"/>
@@ -70,69 +78,96 @@ This idea can even be extended by stochastically sampling a masked aperture imag
7078
| ------------------------------ | ----------------------------- |
7179
| ![](./img/aperture_custom.jpg) | ![](./img/aperture_heart.jpg) |
7280

73-
#### Efficiently Sampling: Sobol Low Discrepancy Sequence
81+
Which creates very interesting and amazing results.
82+
83+
#### Efficient Sampling: Sobol Quasi-Monte Carlo Sequence
84+
85+
In path tracing or any other Monte Carlo-based light transport algorithms, apart from improving
86+
87+
the performance from a point of view of programming, we can also improve it mathematically. Quasi-Monte Carlo sequence is a class of quasi-random sequence that is widely used in Monte Carlo simulation. This kind of sequence is mathematically proved to be more efficient than pseudorandom sequences (like what `thrust::default_random_engine` generates).
88+
89+
Theoretically, to maximize the benefit of Sobol sequence, we need to generate unique sequences for every pixel during each sampling iteration at real-time -- this is not trivial. Not to say that computing each number requires at most 32 bit loops. A better choice would be precomputing one pixel's sequence, then use some sort of perturbation to produce different sequences for different pixels.
90+
91+
Here is the result I get from testing the untextured [PBR texture scene](#representative-outcome). With the same number of samples per pixel, path tracing with Sobol sequence produces much less noise (lower variance).
7492

7593
| Pseudorandom Sequence | Xor-Scrambled Sobol Sequence |
7694
| ---------------------------- | ---------------------------- |
7795
| ![](./img/sampler_indep.jpg) | ![](./img/sampler_sobol.jpg) |
7896

7997

8098

99+
100+
81101
#### Post Processing
82102

83103
##### Gamma Correction
84104

85-
##### Tone Mapping
105+
Implementing gamma correction is very trivial. But it is necessary if we want our final image to be correctly displayed on monitors, through which we see by our eyes.
86106

87-
---
107+
##### Tone Mapping
88108

89109

90110

91111
### Performance
92112

93-
#### Fast Intersection: Stackless SAH-Constructed BVH
113+
#### Fast Intersection: Stackless SAH-Based Bounding Volume Hierarchy
94114

95-
For ray-scene intersection, I did two levels of optimization.
115+
Ray-scene intersection is probably the best time consuming part of
96116

97-
First, I wrote a SAH-based BVH. SAH, the Surface Area Heuristic is a method to decide how to split a set of bounding volumes
117+
I did two levels of optimization.
98118

99-
The second level of optimization
119+
##### Better Tree Structure: Surface Area Heuristic
100120

101-
#### Single-Kernel Path Tracing
121+
First, I implemented a SAH-based BVH. SAH, the Surface Area Heuristic, is a method to determine how to split a set of bounding volumes into subsets when constructing a BVH, that the constructed tree's structure would be highly optimal.
122+
123+
##### Faster Tree Traversal on GPU: Multiple-Threaded BVH
102124

103-
There is a paper . It had an interesting opinion: instead of
125+
The second level of optimization is done on GPU. BVH is a tree after all, so we still have to traverse through it during ray-scene intersection even on GPU.
104126

105127
### Other
106128

107-
#### Streamed Path Tracing Using Stream Compaction
129+
#### Single-Kernel Path Tracing
130+
131+
To figure out how much stream compaction can possibly improve a GPU path tracer's performance, we need a baseline to compare with. Instead of toggling streamed path tracer's kernel to disable stream compaction, we can separately write another kernel that does the entire ray tracing process. That is, we shoot rays, find intersection, shading surfaces and sampling new rays in one kernel.
108132

109133
#### First Ray Caching (G-Buffer)
110134

111-
Since I implemented anti-aliasing and physically based camera at the very beginning, when I noticed that there is still a requirement in the basic part, I found it
135+
In real-time rendering, a technique called deferred shading stores scene's geometry information in texture buffers (G-Buffer) at the beginning of render pass, so that . It turns out we can do something similar with offline rendering.
112136

113137
## Performance Analysis
114138

115-
### How Much GPU Improves Path Tracing Efficiency
139+
### Why My Multi-Kernel Streamed Path Tracer Not Always Faster Than Single-Kernel?
116140

117-
I'm able and confident to answer this question because I have one CPU path tracer from undergrad.
141+
What got me surprised it wasn't that efficient as expected. In some scenes, it was even worse than the single kernel path tracer.
118142

119-
### Why My Multi-Kernel Streamed Path Tracer Not Faster Than Single-Kernel?
143+
In general, it's a tradeoff between thread concurrency and time spent accessing global memory.
120144

121-
To know how streaming the rays can improve path tracing efficiency, I additionally implemented a single-kernel version of this path tracer.
145+
There is a paper stressing this point, from which I also got the idea to additionally implement a single kernel tracer
122146

123-
What got me surprised it wasn't efficient as expected. In some scenes, it was even worse.
147+
- [[Progressive Light Transport Simulation on the GPU: Survey and Improvements]](https://cgg.mff.cuni.cz/~jaroslav/papers/2014-gpult/2014-gpult-paper.pdf)
124148

125-
Using NSight Compute, I inspected
126149

127-
In general, it's a tradeoff between thread concurrency and time spent accessing global memory.
128150

129151
### Material Sorting: Why Slower
130152

153+
After implementing material sorting, I found it actually slower. And not by a little bit, but very significantly. With NSight Compute, I got to inspect how much time each kernel takes before and after enabling material sorting.
154+
155+
Like what the figure below shows, sorting materials does improve memory coalescing for intersection, sampling and stream compaction (I grouped sampling and lighting together because I did direct lighting). However, the effect is not sufficient to tradeoff the additional time introduced with sorting at all. As we can see the test result below, sorting makes up more than 1/3 of ray tracing time.
156+
157+
![](./img/sorted_no_sorted_camera.png)
158+
131159
Or, there is another possibility that BSDF sampling and evaluation is not that time consuming as expected. The bottleneck still lies in traversal of acceleration structure.
132160

133161
Therefore, in my opinion, material sorting is best applied when:
134162

135163
- There are many different materials in the scene
136164
- Primitives sharing the same material are randomly distributed in many small clusters over the scene space. The clusters' sizes in solid angle are typically less than what a GPU warp can cover
137165

166+
167+
168+
### How Much GPU Improves Path Tracing Efficiency Compared to CPU
169+
138170
### Image Texture vs. Procedural Texture
171+
172+
173+

img/random.jpg

24.5 KB
Loading

img/sobol.jpg

27.5 KB
Loading

img/sorted_no_sorted_camera.png

27.2 KB
Loading

src/common.h

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
#pragma once
22
#include <iostream>
33

4-
#define SAMPLER_USE_SOBOL true
4+
#define SAMPLER_USE_SOBOL false
55

66
#define SCENE_LIGHT_SINGLE_SIDED true
77

src/pathtrace.cu

Lines changed: 6 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -597,10 +597,8 @@ struct RemoveInvalidPaths {
597597
}
598598
};
599599

600-
/**
601-
* Wrapper for the __global__ call that sets up the kernel calls and does a ton
602-
* of memory management
603-
*/
600+
static
601+
604602
void pathTrace(uchar4* pbo, int frame, int iter) {
605603
const Camera& cam = hstScene->camera;
606604
const int pixelCount = cam.resolution.x * cam.resolution.y;
@@ -611,16 +609,16 @@ void pathTrace(uchar4* pbo, int frame, int iter) {
611609
(cam.resolution.x + blockSize2D.x - 1) / blockSize2D.x,
612610
(cam.resolution.y + blockSize2D.y - 1) / blockSize2D.y);
613611

614-
generateRayFromCamera<<<blocksPerGrid2D, blockSize2D>>>(hstScene->devScene, cam, iter, Settings::traceDepth, devPaths);
615-
checkCUDAError("PT::generateRayFromCamera");
616-
cudaDeviceSynchronize();
617-
618612
int depth = 0;
619613
int numPaths = pixelCount;
620614

621615
auto devTerminatedThr = devTerminatedPathsThr;
622616

623617
if (Settings::tracer == Tracer::Streamed) {
618+
generateRayFromCamera<<<blocksPerGrid2D, blockSize2D>>>(hstScene->devScene, cam, iter, Settings::traceDepth, devPaths);
619+
checkCUDAError("PT::generateRayFromCamera");
620+
cudaDeviceSynchronize();
621+
624622
bool iterationComplete = false;
625623
while (!iterationComplete) {
626624
// clean shading chunks

0 commit comments

Comments
 (0)