RichieSam's Adventures in Code-ville: March 2015

Friday, March 13, 2015

Shooting Objects from Across the Way

This is the third post in a series documenting my adventures in creating a GPU path tracer from scratch
1. Tracing Light in a Virtual World
2. Creating Randomness and Accumulating Change

If you recall, the very high level algorithm for path tracing is:

From each pixel, fire a ray out into the scene from the eye.
If the ray hits an object

Use the material properties to accumulate attenuation
Bounce the ray in a new direction
GOTO 2

If the ray hits a light

Multiply the light and the attenuation
Add the result to the accumulation buffer
GOTO 5

If the ray doesn't hit anything

GOTO 5

GOTO 1 until sufficiently converged

The first post in the series covered step 1 and the second post covered step 3b. This post is going to cover how to test if a ray hits an object in the scene (aka, 2 and 3).

Ray - Sphere Intersection

Foreword: Scratchapixel has an excellent lesson covering ray-object intersections. Much of what I say below is based on their lessons. Much props to the authors.

A sphere can be represented mathematically with vector representing the location of the sphere's center, and its radius. (You can save a multiplication in a later calculation if you store the radius squared instead)

struct Sphere {
    float3 Center;
    float RadiusSquared;
};

A ray can be represented as a vector representing its origin and a unit vector representing its direction:

struct Ray {
    float3 Origin;
    float3 Direction;
};

Let's look at a picture of the most common type of ray - sphere intersection:

\[\begin{align}
\overrightarrow{R_{o}} &= ray \: origin \\
\overrightarrow{R_{d}} &= ray \: direction \\
\overrightarrow{S_{c}} &= sphere \: center \\
S_{r} &= sphere \: radius \\
\overrightarrow{P_{0}} &= first \: intersection \: point \\
\overrightarrow{P_{1}} &= second \: intersection \: point \\
t_{0} &= distance \: from \: \overrightarrow{R_{o}} \: to \: \overrightarrow{P_{0}} \\
t_{1} &= distance \: from \: \overrightarrow{R_{o}} \: to \: \overrightarrow{P_{1}} \\
\end{align}\]

We would like to find $\overrightarrow{P_{0}}$ and $\overrightarrow{P_{1}}$. Mathematically, they are defined as:
\[\begin{align}
\overrightarrow{P_{0}} &= \overrightarrow{R_{o}} + t_{0} \overrightarrow{R_{d}} \\
\overrightarrow{P_{1}} &= \overrightarrow{R_{o}} + t_{1} \overrightarrow{R_{d}} \\
\end{align}\]

We already know $\overrightarrow{R_{o}}$ and $\overrightarrow{R_{d}}$, so we just need to find $t_{0}$ and $t_{1}$. In order to do so, let's define a few new variables.

And then define $t_{0}$ and $t_{1}$ in terms of $t_{ca}$ and $t_{hc}$:
\[\begin{align}
t_{0} &= t_{ca} - t_{hc} \\
t_{1} &= t_{ca} + t_{hc} \\
\end{align}\]

Now we can begin solving for our unknown variables. To start, let's look at the right triangle formed by $t_{ca}$, $\overrightarrow{L}$, and $d$.

We can solve for $t_{ca}$ by using the definition of cosine and the dot product:
\[\cos\left ( \theta \right ) = \frac{adjacent}{hypotenuse} = \frac{t_{ca}}{\left \| \overrightarrow{L} \right \|}\]
\[\overrightarrow{m} \cdot \overrightarrow{n} = \left \| m \right \| \left \| n \right \| \cos \left ( \theta \right )\]

\[\begin{split} ie: \end{split} \qquad \qquad
\begin{split}
\overrightarrow{R_{d}} \cdot \overrightarrow{L} &= \left \| \overrightarrow{R_{d}} \right \| \left \| \overrightarrow{L} \right \| \cos \left ( \theta \right ) \\
&= \frac{\left \| \overrightarrow{L} \right \| t_{ca}}{\left \| \overrightarrow{L} \right \|} \\
&= t_{ca}
\end{split}\]

$\overrightarrow{R_{d}}$ is a unit vector. Therefore, $\left \| \overrightarrow{R_{d}} \right \| = 1$ and cancels out. Then, if we replace $\cos \left ( \theta \right )$ with its definition, we can cancel $\left \| \overrightarrow{L} \right \|$ from top and bottom. Thus, we're left with just $t_{ca}$.

Using the Pythagorean Theorem and a trick with the dot product, we can solve for d:
\[\overrightarrow{m} \cdot \overrightarrow{m} \equiv \left \| \overrightarrow{m} \right \| ^{2}\]

\[\begin{align} t_{ca} \: ^{2} + d^{2} &= \left \| \overrightarrow{L} \right \| ^{2}\\
t_{ca} \: ^{2} + d^{2} &= \overrightarrow{L} \cdot \overrightarrow{L} \\
d^{2} &= \overrightarrow{L} \cdot \overrightarrow{L} - t_{ca} \: ^{2}
\end{align}\]

To solve for $t_{hc}$, let's look at the triangle formed by $S_{r}$, $t_{hc}$, and $d$.

Using the Pythagorean Theorem again, we can solve for $t_{hc}$:
\[\begin{align}
t_{hc} \: ^{2} + d^{2} &= S_{r} \: ^{2} \\
t_{hc} &= \sqrt{S_{r} \: ^{2} - d^{2}}
\end{align}\]

Whew! We made it! Using the above equations, we can find the two intersection points. Let's look at some code that implements the above and walk through some special cases / early outs.

/**
 * Test for the intersection of a ray with a sphere
 *
 * @param ray           The ray to test
 * @param sphere        The sphere to test
 * @param normal_out    Filled with normal of the surface at the intersection point. Not changed if no intersection.
 * @return              The distance from the ray origin to the nearest intersection. -1.0f if no intersection
 */
__device__ float TestRaySphereIntersection(Scene::Ray &ray, Scene::Sphere &sphere, float3 &normal_out) {
    float3 L = sphere.Center - ray.Origin;
    float t_ca = dot(L, ray.Direction);
 
    // Ray points away from the sphere
    if (t_ca < 0) {
        return -1.0f;
    }
 
    float d_squared = dot(L, L) - t_ca * t_ca;
 
    // Ray misses the sphere
    if (d_squared > sphere.RadiusSquared) {
        return -1.0f;
    }
 
    float t_hc = sqrt(sphere.RadiusSquared - d_squared);
 
    float t_0 = t_ca - t_hc;
    float t_1 = t_ca + t_hc;
 
    float nearestIntersection;
    float normalDirection;
    if (t_0 > 0 && t_1 > 0) {
        // Two intersections
        // Return the nearest of the two
        nearestIntersection = min(t_0, t_1);
 
        normalDirection = 1.0f;
    } else {
        // Ray starts inside the sphere
 
        // Return the far side of the sphere
        nearestIntersection = max(firstIntersection, secondIntersection);
 
        // We reverse the direction of the normal, since we are inside the sphere
        normalDirection = -1.0f;
    }
 
    normal_out = normalize(((ray.Origin + (ray.Direction * nearestIntersection)) - sphere.Center) * normalDirection);
 
    return nearestIntersection;
}

The algorithm can be summarized as follows:

Calculate $t_{ca}$
If $t_{ca}$ is negative, $\overrightarrow{R_{d}}$ is pointing away from the sphere. Thus, there can not be an intersection
Calculate $d^{2}$
If $d^{2}$ is greater than $S_{r} \: ^{2}$, the ray misses the sphere.
Calculate $t_{hc}$
Calculate $t_{0}$ and $t_{1}$.
If $t_{0}$ and $t_{1}$ are both positive, the ray starts outside the sphere and intersects it twice. Choose the closest of the two intersections
If either $t_{0}$ or $t_{1}$ is negative, the ray starts inside the sphere and intersects it on the way out. Choose the positive intersection.

They both can't both be negative, since that would mean the ray is pointing away from the sphere, and we already checked for that.
See following picture.

Ray - Plane Intersection

A plane can be represented using any point on the plane, and a unit normal vector from the plane.

struct Plane {
    float3 Point;
    float3 Normal;
};

\[\begin{align}
\overrightarrow{P_{p}} &= plane \: point \\
\overrightarrow{P_{n}} &= plane \: normal
\end{align}\]

Let's look at the intersection from the top.

We would like to find $d$. We can start by taking the dot product between $\overrightarrow{R_{d}}$ and $\overrightarrow{P_{n}}$. We get $\cos \left ( \theta_{1} \right )$, since both $\overrightarrow{R_{d}}$ and $\overrightarrow{P_{n}}$ are unit vectors.
\[\begin{align}
\overrightarrow{R_{d}} \cdot \overrightarrow{P_{n}} &= \left \| \overrightarrow{R_{d}} \right \| \left \| \overrightarrow{P_{n}} \right \| \cos \left ( \theta_{1} \right ) \\
&= \cos \left ( \theta_{1} \right )
\end{align}\]

Let's look at the triangle formed by $\overrightarrow{P_{n}}$ and $\overrightarrow{L}$, where $\overrightarrow{L}$ is the vector between $\overrightarrow{P_{p}}$ and $\overrightarrow{R_{o}}$.

If we take the dot product between $\overrightarrow{P_{n}}$ and $\overrightarrow{L}$, we get $a$.

\[\begin{align}
\overrightarrow{P_{n}} \cdot \overrightarrow{L} &= \left \| \overrightarrow{P_{n}} \right \| \left \| \overrightarrow{L} \right \| \cos \left ( \theta_{2} \right ) \\
&= \frac{\left \| \overrightarrow{L} \right \| a}{\left \| \overrightarrow{L} \right \|} \\
&= a
\end{align}\]

Finally, let's look at the two triangles formed between $\overrightarrow{P_{n}}$, $\overrightarrow{R_{d}}$, and the plane itself.

They are similar triangles, so we can solve for $d$ using the Law of Similar Triangles and the definition of cosine:
\[\cos \left ( \theta_{1} \right ) = \frac{b}{c}\]

\[\begin{align}
\frac{a + b}{c + d} &= \frac{b}{c} \\
c \left ( a + b \right ) &= b \left ( c + d \right ) \\
ac + bc &= bc + bd \\
ac &= bd \\
d &= \frac{ac}{b} \\
d &= \frac{a}{\cos \left ( \theta_{1} \right )}
\end{align}\]

Let's look at the code implementation of the above and walk through it.

/**
 * Test for the intersection of a ray with a plane   
 *
 * @param ray           The ray to test
 * @param plane         The plane to test
 * @param normal_out    Filled with normal of the surface at the intersection point. Not changed if no intersection.
 * @return              The distance from the ray origin to the nearest intersection. -1.0f if no intersection
 */
__device__ float TestRayPlaneIntersection(Scene::Ray &ray, Scene::Plane &plane, float3 &normal_out) {
    float cos_theta1 = dot(plane.Normal, ray.Direction);
 
    // If cos_theta1 is greater than -epison,
    // the ray is perpendicular to the plane or points away from the plane normal
    if (cos_theta1 > -1.0e-6f) {
        return -1.0f;
    }
 
    normal_out = plane.Normal;
    float a = dot(plane.Normal, plane.Point - ray.Origin);
    return a / cos_theta1;
}

The algorithm can be summarized as follows:

Calculate $\cos \left ( \theta_{1} \right )$
If $\cos \left ( \theta_{1} \right )$ is greater than or equal to zero, the ray is perpendicular to the plane, or faces away

In the code, we use a small episilon to account for floating point innacuracies

Calculate a
Return $\frac{a}{\cos \left ( \theta_{1} \right )}$

Testing Them Out

In order to test the code out, I added on to the eye ray code we created in the first post. After creating the ray, I tried to intersect it with an object in the scene. If it hit, I did some basic Lambertian shading with a hard-coded directional light. If it didn't, I colored the pixel black. Here's the results:

Very nice! If I do say so myself. I don't have a plane in the scene because I didn't implement the plane intersection code until I had skipped ahead to the path tracing itself. (I got too excited. Ha ha!) So you'll have to take my word for it that the code does work. Below is the kernel code used to create the image above:

__global__ void PathTraceKernel(unsigned char *textureData, uint width, uint height, size_t pitch, DeviceCamera *g_camera, Scene::Sphere *g_spheres, uint numSpheres, uint hashedFrameNumber) {
    int x = blockIdx.x * blockDim.x + threadIdx.x;
    int y = blockIdx.y * blockDim.y + threadIdx.y;
 
    // Create a local copy of the camera
    DeviceCamera camera = *g_camera;
 
    // Global threadId
    int threadId = (blockIdx.x + blockIdx.y * gridDim.x) * (blockDim.x * blockDim.y) + (threadIdx.y * blockDim.x) + threadIdx.x;
 
    // Create random number generator
    curandState randState;
    curand_init(hashedFrameNumber + threadId, 0, 0, &randState);
 
    // Calculate the first ray for this pixel
    Scene::Ray ray = {camera.Origin, CalculateRayDirectionFromPixel(x, y, width, height, camera, &randState)};
 
    // Generate a uniform random number
    //float randNum = curand_uniform(&randState);
 
    // Try to intersect with the spheres;
    float closestIntersection = FLT_MAX;
    float3 normal;
    for (uint i = 0; i < numSpheres; ++i) {
        float3 newNormal;
        float intersection = TestRaySphereIntersection(ray, g_spheres[i], newNormal);
        if (intersection > 0.0f && intersection < closestIntersection) {
            closestIntersection = intersection;
            normal = newNormal;
        }
    }
 
    float3 pixelColor;
    if (closestIntersection < FLT_MAX) {
        float attentuation = max(dot(normal, make_float3(0.70710678118f, 0.70710678118f, -0.70710678118f)), 0.0f);
        pixelColor = make_float3(0.846, 0.933, 0.949) * attentuation + make_float3(0.15f, 0.15f, 0.15f);
    } else {
        pixelColor = make_float3(0.0f, 0.0f, 0.0f);
    }
 
    if (x < width && y < height) {
        // Get a pointer to the pixel at (x,y)
        float *pixel = (float *)(textureData + y * pitch) + 4 /*RGBA*/ * x;
 
        // Write out pixel data
        pixel[0] += pixelColor.x;
        pixel[1] += pixelColor.y;
        pixel[2] += pixelColor.z;
        // Ignore alpha, since it's hardcoded to 1.0f in the display
        // We have to use a RGBA format since CUDA-DirectX interop doesn't support R32G32B32_FLOAT
    }

Conclusion

There we go! We now have everything we need to do path tracing. The next post will be just that: creating a basic path tracing kernel. I've already implemented it in code, so it's just a matter of how quickly I can write up the post. (teaser picture). Keep an eye out!

The code for everything in this post is on GitHub. It's open source under the Apache license, so feel free to use it in your own projects. The path tracing code is already pushed to the repo, if you're feeling antsy, you can give that a look.

As always, feel free to ask questions, make comments, and if you find an error, please let me know.

Happy coding!
-RichieSams

Friday, March 6, 2015

Creating Randomness and Acummulating Change

This is the second post in a series documenting my adventures in creating a GPU path tracer from scratch. If you missed it, the first post is here.

Path tracing uses Monte Carlo Integration to estimate the Global Illumination in the scene. In our case, Monte Carlo Integration boils down to taking a large number of random samples of the scene, and averaging them together. Random is the key word here. If we don't randomly sample, the resulting image will have artifacts in the form of patterns, banding, etc.

Creating Randomness from the Non-Random

So how do we create random numbers? This is a really old topic that has been extensively researched, so rather than reiterate it here, I'll just point you to Google. The point we do care about, though, is where we create the random numbers. As far as I can see, our options are as follows:

Generate a large number of random numbers on the CPU using classic psuedo-random number generators (PRNG), and transfer them to the GPU to be consumed as needed.
Create a random number generator on the GPU, and access it from each thread
Create a random number generator per thread on the GPU

While option 1 looks simple and straightforward, path tracing will use a large number of random numbers, and each number consumed has to be transferred across the bus from the CPU to the GPU. This is going to be SLOW.

Ok, with the CPU out of the picture, we need to find a PRNG for the GPU. Luckily, CUDA comes with a library that does just that: curand. But, how many PRNGs should we have, and where should they live?

To better understand the problem, let's briefly go over how PRNGs work. PRNGs use math to simulate random sequences. In order to get different numbers, they need to access and store internal state, aka data. The size of this state depends on the PRNG algorithm.

If we choose option 2, every single thread will want access to a single generator. There will be massive contention, and random number generation will degrade to a serial operation. Again, since path tracing requires a large number of random numbers, this option will be slow.

So option 3 it is! It turns out that the size of the state for the default curand generator isn't that big, so storing state per thread isn't too bad.

Implementing curand in the Kernel

In order to create a generator in the kernel, you have to create a curandstate object and then call curand_init on it, passing in a seed, a sequence number, and an offset.

curandState randState;
curand_init(seed, sequenceNum, offset, &randState);

Then, you can generate random numbers using:

uint64 randInteger = curand(&randState);
float randNormalizedFloat = curand_uniform(&randState);

Two states with different seeds will create a different sequence of random numbers. Two states with the same seed will create the same sequence of random numbers.

Two states with the same seed, but different sequenceNum will use the same sequence, but be offset to different blocks of the sequence. (Specifically, in increments of 2⁶⁷) Why would you want this? According to the documentation, "Sequences generated with the same seed and different sequence numbers will not have statistically correlated values."

Offset just manually skips ahead n in the sequence.

In the curand documentation, the creators mention that curand_init() can be relatively slow, so if you're launching the same kernel multiple times, it's usually better to keep one curandstate per thread, but to store it in global memory between kernel launches. ie:

__global__ void setupRandStates(curandState *state) {
    int id = threadIdx.x + blockIdx.x * blockDim.x;
 
    // Each thread gets same seed, a different sequence number, no offset
    curand_init(1234, id, 0, &state[id]);
}
 
__global__ void useRandStates(curandState *state) {
    int id = threadIdx.x + blockIdx.x * blockDim.x;
    
    // Copy state to local memory for efficiency 
    curandState localState = state[id];
 
    // Use localState to generate numbers....
 
 
    // Copy state back to global memory 
    state[id] = localState;
}
 
int main() {
    int numThreads = 64;
    int numBlocks = 64;
 
    // Allocate space on the device to store the random states
    curandState *d_randStates;
    cudaMalloc(&d_randStates, numBlocks * numThreads * sizeof(curandState));
 
    // Setup and use the randStates
    setupRandStates<<<numBlocks, numThreads>>>(d_randStates);
 
    for (uint i = 0; i < NUM_ITERATIONS; ++i) {
        useRandStates<<<numBlocks, numThreads>>>(d_randStates);
    }
 
    return 0;
}

This would be really nice, but there is one problem: storing all the states:

cudaMalloc(&d_randStates, numBlocks * numThreads * sizeof(curandState));

In our case, we'll be launching a thread for every pixel on the screen. aka, millions of threads. While curandState isn't that large, storing millions of them is not feasible. So what can we do instead? It turns out that curand_init() is only slow if you use sequenceNum and offset. This is quite intuitive, since using those requires the generator to skip ahead a large amount. So if we keep both sequenceNum and offset equal to zero, curand_init() is quite fast.

In order to give each thread different random numbers we give them unique seeds. A simple method I came up with is to hash the frameNumber and then add the id of thread.

uint32 WangHash(uint32 a) {
    a = (a ^ 61) ^ (a >> 16);
    a = a + (a << 3);
    a = a ^ (a >> 4);
    a = a * 0x27d4eb2d;
    a = a ^ (a >> 15);
    return a;
}
 
__global__ void generateRandNumbers(uint hashedFrameNumber) {
    // Global threadId
    int threadId = (blockIdx.x + blockIdx.y * gridDim.x) * (blockDim.x * blockDim.y) + (threadIdx.y * blockDim.x) + threadIdx.x;
 
    // Create random number generator
    curandState randState;
    curand_init(hashedFrameNumber + threadId, 0, 0, &randState);
 
    // Use randState to generate numbers...
}
 
int main() {
    uint frameNumber = 0;
    uint width = 1024;
    uint height = 256;
 
    dim3 Db = dim3(16, 16);   // block dimensions are fixed to be 256 threads
    dim3 Dg = dim3((width + Db.x - 1) / Db.x, (height + Db.y - 1) / Db.y);
 
    for (uint i = 0; i < NUM_ITERATIONS; ++i) {
        generateRandNumbers<<<Dg, Db >>>(WangHash(frameNumber++));
    }
 
    return 0;
}

Yay! Now we can generate lots of random numbers! If we generate a random number on each thread and output it as a greyscale color, we can make some nice white noise.

Accumulating and Averaging Colors

The last part of Monte Carlo Integration is the averaging of all the samples taken. The simplest solution is to just accumulate the colors from each frame, adding one frame to the next. Then, at the end, we divide the color at each pixel by the number of frames. I integrated this into my code by passing the frame number into the pixel shader that draws the texture to the screen.

cbuffer constants {
    float gInverseNumPasses;
};
 
 
Texture2D<float3> gHDRInput : register(t0);
 
float4 CopyCudaOutputToBackbufferPS(CalculatedTrianglePixelIn input) : SV_TARGET {
    return float4(gHDRInput[input.positionClip.xy] * gInverseNumPasses, 1.0f);
}

This way, I can see the accumulated output as it's being generated. If we create a simple kernel that outputs either pure red, green, or blue, depending on a random number, we can test if the accumulation buffer is working.

__global__ void AccumulationBufferTest(unsigned char *textureData, uint width, uint height, size_t pitch, DeviceCamera camera, uint hashedFrameNumber) {
    int x = blockIdx.x * blockDim.x + threadIdx.x;
    int y = blockIdx.y * blockDim.y + threadIdx.y;
 
    if (x >= width || y >= height) {
        return;
    }
 
    // Global threadId
    int threadId = (blockIdx.x + blockIdx.y * gridDim.x) * (blockDim.x * blockDim.y) + (threadIdx.y * blockDim.x) + threadIdx.x;
 
    // Create random number generator
    curandState randState;
    curand_init(hashedFrameNumber + threadId, 0, 0, &randState);
 
    // Generate a uniform random number
    float randNum = curand_uniform(&randState);
 
    // Get a pointer to the pixel at (x,y)
    float *pixel = (float *)(textureData + y * pitch) + 4 /*RGBA*/ * x;
 
    if (x < width && y < height) {
        // Write out pixel data
        if (randNum < 0.33f) {
            pixel[0] += 1.0f;
            pixel[1] += 0.0f;
            pixel[2] += 0.0f;
            pixel[3] = 1.0f;
        } else if (randNum < 0.66f) {
            pixel[0] += 0.0f;
            pixel[1] += 1.0f;
            pixel[2] += 0.0f;
            pixel[3] = 1.0f;
        } else {
            pixel[0] += 0.0f;
            pixel[1] += 0.0f;
            pixel[2] += 1.0f;
            pixel[3] = 1.0f;
        }
    }
}

After the first 60 frames, the image is quite noisy:

However, if we let it sit for a bit, the image converges to the a gray (0.33, 0.33, 0.33) as expected:

Conclusion

Well, there we go! We can generate random numbers and average the results from several frames. The next post will cover ray-object intersections and maybe start in on path tracing itself. Stay tuned!

The code for everything in this post is on GitHub. It's open source under Apache license, so feel free to use it in your own projects.

As always, feel free to ask questions, make comments, and if you find an error, please let me know.

Happy coding!
-RichieSams

Thursday, March 5, 2015

Tracing Light in a Virtual World

Boy it sure has been a while since I've written a blog post! Apologies for that.

I'm still working on The Halfling Project, with the newest project being Physically Based Rendering. I tried to implement environment maps in order to have specular reflections, but I got frustrated. So, I decided put The Halfling Project aside for a bit and try something new, specifically, Path Tracing.

I had a basic idea of how path tracing worked:

Fire a crap-load of rays around the scene from the eye.
At each bounce, use the material properties to accumulate attenuation
If the ray hits a light, add the light to the pixel, taking into account the attenuation

Sweet. Now give me 2 hours and I'll have the next Arnold Renderer, right?!? HA!

Ok. For real though. Googling pulled up lots of graphics papers and a few example projects. However, most, if not all of the graphics papers were advanced topics that expanded on path tracing, rather than explaining basic path tracing. (ie. Metropolis Light Transport, Half Vector Space Light Transport, Manifold Exploration Metropolis Light Transport, etc.) While they are really interesting (and some have example code), I felt they were way too much to try all at once. So, I was left with the examples I was able to find.

Example Projects

The first is smallpt, or more specifically, the expanded code/presentation done by Dr. David Cline at Oklahoma State University. Smallpt is true to its name, ie small. Therefore, as a learning tool, the reduced form is not very readable. However, Dr. Cline took the original code and expanded it out into a more readable form and created an excellent presentation going over each portion of the code.

The next example project I found was Minilight. It has excellent documentation and port to many different languages. It also has the algorithm overview in various levels of detail, which is really nice.

At this point, I realized that I had two choices. I could either implement the path tracer on the CPU (as in smallpt or Minilight), or on the GPU. Path tracing on the GPU is a bit of a recent advance, but it is possible and can work very well. So, being a bit of a masochist, and enjoying GPU programming thus far, I chose the GPU path.

The last extremely useful example project I found was a class project done by two students at The University of Pennsylvania, (Peter Kutz and Yining Karl Li), for a computer graphics class they took. The really great part, though, is that they kept a blog and documented their progress and the hurdles they had to overcome. This is extremely useful, because I can see some of the decisions they made as they added on features. It also allowed me to create a series of milestones in creating a project using their progress as a model.

For example:

Be able to cast a ray from the eye through a pixel and display a color representation of the rays.
Implement random number generation
Implement an accumulation buffer
Implement ray intersection tests
Implement basic path tracing with fixed number of bounces.
Implement Russian Routlette termination
Implement ray/thread compaction
Implement Specular / BRDFs
Etc.

Choosing a GPGPU API

With a basic plan set out, my last choice was in the GPGPU programming API. My choices were:

DirectX / OpenGL Compute Shaders
CUDA
OpenCL

I did quite a bit of searching around, but I wasn't really able to find a clear winner. A few people believe that CUDA is a bit faster. However, a lot of the posts are old-ish, so I don't really know how they stack up against newer versions of hlsl/glsl Compute Shaders. I ended up choosing CUDA, but Compute Shaders or OpenCL could probably perform just as well. I chose CUDA mostly to learn something new. Also, many existing GPU path tracing examples happen to be in CUDA, so it easier to compare their code with mine if I choose CUDA.

Off to the Realm of CUDA

First programs call for a "Hello World!" But how to do a hello world in a massively parallel environment? I mean, I guess we could write out "Hello World!" in every thread, but that's kind of boring, right? So let's store each letter used in one array, the offset to the letters in another array, and then calculate "Hello World!" in the threads. Now that's more like it!

We only store the necessary letters. "Hello World!" is stored as indices into the character array.

char h_inputChars[10] = "Helo Wrd!";
uint h_indexes[13] = {0, 1, 2, 2, 3, 4, 5, 3, 6, 2, 7, 8, 9};

Then the kernel calculates the output string using the thread index:

__global__ void helloWorldKernel(char *inputChars, uint *indices, char *output) {
    uint index = blockIdx.x * blockDim.x + threadIdx.x;
 
    output[index] = inputChars[indices[index]];
}

Yea, yea. Super inefficient. But hey, at least we're doing something.

Ok, so now we know how to create the basic CUDA boilerplate code and do some calculations. The next step is figuring out how to talk between CUDA and DirectX so we can display something on the screen.

Bringing DirectX to the Party

Thankfully, the folks over at nVidia have made a library to do just that. They also have quite a few example projects to help explain the API. So, using one of the example projects as a guide and taking some boilerplate rendering code from The Halfling Project, I was able to create some cool pulsing plaid patterns:

Untz! Untz! Untz! Untz! Huh? Oh.. this isn't the crazy new club?

Casting the First Rays

Ok, so now we can do some computation in CUDA, save it in an array, and DirectX will render the array as colors. On to rays!

In path tracing, the first rays we shoot out are the ones that go from the eye through each pixel on the virtual screen.

In order to create the ray, we need to know the distances a and b in world units. Therefore, we need to convert pixel units into world units. To do this we need a define a camera. Let's define an example camera as follows:

\[origin = \begin{bmatrix} 0 & 0 & 0 \end{bmatrix}\]
\[coordinateSystem = \begin{bmatrix} 1 & 0 & 0\\ 0 & 1 & 0\\ 0 & 0 & 1 \end{bmatrix}\]
\[fov_{x} = 90^{\circ}\]
\[fov_{y} = \frac{fov_{x}}{aspectRatio}\]
\[nearClipPlaneDist = 1\]

The field of view, or fov is an indirect way of specifying the ratio of pixel units to view units. Specifically, it is the viewing angle that is seen by the camera.

The higher the angle, the more of the scene is seen. But remember, changing the fov does not change the size of the screen, it merely squishes more or less of the scene into the same number of pixels.

Let's look at the triangle formed by fov_x and the x-axis:

We can use the definition of tangent to calculate the screenWidth in view units
\[\tan \left (\theta \right) = \frac{opposite}{adjacent}\]
\[screenWidth_{view}= 2 \: \cdot \: nearClipPlaneDist \: \cdot \: \tan \left (\frac{fov_{x}}{2} \right)\]

Using that, we can calculate the view units of the pixel.
\[x_{homogenous}= 2 \: \cdot \: \frac{x}{width} \: - \: 1\]
\[x_{view} = nearClipPlaneDist \: \cdot \: x_{homogenous} \: \cdot \: \tan \left (\frac{fov_{x}}{2} \right)\]

The last thing to do to get the ray is to transform from view space to world space. This boils down to a simple matrix transform. We negate y_view because pixel coordinates go from the top left of the screen to the bottom right, but homogeneous coordinates go from (-1, -1) at the bottom left to (1, 1) at the top right.
\[ray_{world}= \begin{bmatrix} x_{view} & -y_{view} & nearClipPlaneDist \end{bmatrix}\begin{bmatrix} & & \\ & cameraCoordinateSystem & \\ & & \end{bmatrix}\]

The code for the whole process is below. (In the code, I assume the nearClipPlaneDist is 1, so it cancels out)

__device__ float3 CalculateRayDirectionFromPixel(uint x, uint y, uint width, uint height, DeviceCamera &camera) {
    float3 viewVector = make_float3((((x + 0.5f) / width) * 2.0f - 1.0f) * camera.tanFovDiv2_X,
                                    -(((y + 0.5f) / height) * 2.0f - 1.0f) * camera.tanFovDiv2_Y,
                                    1.0f);
 
    // Matrix multiply
    return normalize(make_float3(dot(viewVector, camera.x),
                                 dot(viewVector, camera.y),
                                 dot(viewVector, camera.z)));
}

If we normalize the ray directions to renderable color ranges (add 1.0f and divide by 2.0f) and render out the resulting rays, we get a pretty gradient that varies from corner to corner.

Well, as this post is getting pretty long, I feel like this is a nice stopping point. The next post will be a short one on generating random numbers on the GPU and implementing an accumulation buffer. I've already implemented the code, so the post should be pretty soon.

After that, I will start implementing the object intersection algorithms, then the path tracing itself! Expect them in the coming week or so!

The code for everything in this post is on GitHub. It's open source under Apache license, so feel free to use it in your own projects.

As always, feel free to ask questions, make comments, and if you find an error, please let me know.

Happy coding!
RichieSams