Create graphics with less power and less memory.
For the past three decades, there have been two main schools of thought dominating the center stage of 3D graphics: rasterization and ray tracing. If you really don’t care about the finer details and you want a five-second explanation of ray tracing, then all you need to know is that ray tracing will turn something like this:
Since the 1990s, we’ve seen an exponential increase in rasterized GPU performance. However, each advance in visual quality has made content generation more complicated and costly. Every effect (shadow, reflection, highlight, or subtle change in illumination) must be anticipated by the designer or game engine, and special-cased. This work often results in massive game title budgets, slower time to market, and ultimately less content on the platform.
Ray tracing takes a different approach to rasterized rendering by closely modeling the physics of light; it provides the only way to achieve photorealism for interactive content at an affordable price. Even though you might not have heard of ray tracing before, there’s a high probability you’ve already seen what it can do: Ray tracing is the primary technique used for special effects rendering in movies. In addition, the techniques to produce realistic-looking games also use ray tracing offline in a process called pre-baking.
Finally, ray tracing has been used to seamlessly blend captured real life with computer-generated images (e.g.., VR/AR applications) as it’s better at simulating optical effects like reflection and refraction, scattering, and dispersion phenomena compared to traditional rendering methods like scanline.
The main reason why ray tracing hasn’t been fully embraced at a larger scale is that traditionally, any real-time ray-tracing application needed a GPU beast running an OpenCL-based algorithm to produce images similar to the one above. Since most people don’t have a water-cooled rig of 300 W GPUs lying around, ray tracing never really took off. The problem with using GPU compute for ray tracing is its inefficiency in handling the complex issue of tracking coherency between rays. You see, GPU compute is extremely good at data parallel operations that obey a certain (and predictable) structure. Image processing is a perfect application for GPU compute: You apply the same filter repeatedly on every pixel in a rinse and repeat fashion.
Imagination’s PowerVR Wizard, for example, abandons this flawed computation-intensive approach and implements a solution that’s modeled after how ray tracing actually works. PowerVR Ray Tracing GPUs start with a pixel. For each step, the engine tracks intersections of a ray with a set of relevant primitives of the scene and performs geometry processing using a shader, exactly as in rasterization. Behind the scenes, the RTU (ray tracing unit) assembles the triangles into a searchable database representing the 3D world.
When a shader is run for each pixel, rays are cast into the 3D world. This is when the RTU steps in again to search the database to determine which triangle is the closest intersection with the ray. If a ray intersects a triangle, the shader can contribute color into the pixel, or cast additional rays; these secondary rays can in turn cause the execution of additional shaders.
All of this wizardry (pardon the pun) is implemented directly in hardware which makes the whole GPU very fast at handling specialized ray tracing tasks and conserves power, making it 100x more energy efficient than competing solutions. This is the key advantage of using an RTU alongside a traditional GPU for ray tracing. By using specialized hardware to track secondary rays and their coherency, an impressive speedup of the complicated computational process is achieved. This results in a higher degree of visual realism provided in real time.
Implementing fast, ray traced soft shadows in a game engine
Not only does ray tracing create more accurate shadows that are free from the artifacts of shadow maps, but ray-traced shadows are also up to twice as efficient; they can be generated at comparable or better quality in half the GPU cycles and with half the memory traffic (more on this later).
Cascaded shadow maps in traditional rasterized graphics
Let’s go through the process of implementing an efficient technique for soft shadows. First, let’s review cascaded shadow maps, the current state of the art technique to generate shadows for rasterized graphics. The idea behind cascaded shadow maps is to take the view frustum, divide it into a number of regions based on distance from the viewpoint, and render a shadow map for each region. This will offer variable resolution for shadow maps: the objects that are closer to the camera will get higher resolution, while the objects that fall off into the distance will get lower resolution per unit area.
In the diagram below, we can see an example of several objects in a scene. Each shadow map is rendered, one after another, and each covers an increasingly larger portion of the scene. Since all of the shadow maps have the same resolution, the density of shadow map pixels goes down as we move away from the viewpoint.
Finally, when rendering the objects again in the final scene using the camera’s perspective, we select the appropriate shadow maps based on each object’s distance from the viewpoint, and interpolate between those shadow maps to determine if the final pixel is lit or in shadow.
All of this complexity serves one purpose: to reduce the occurrence of resolution artifacts caused by a shadow map being too coarse. This works because an object further away from the viewpoint will occupy less space on the screen and therefore less shadow detail is needed. And it works nicely, although the cost in GPU cycles and memory traffic is significant.
Enter ray traced shadows
Ray traced shadows fundamentally operate in screen space. For that reason, there’s no need to align a shadow map’s resolution to the screen resolution; the resolution problem just doesn’t exist.
The basic ray traced shadow algorithm works like this: for every point on a visible surface, we shoot one ray directly toward the light. If the ray reaches the light, then that surface is lit and we use a traditional lighting routine. If the ray hits anything before it reaches the light, then that ray is discarded because that surface is in a shadow. This technique produces crisp, hard shadows like the ones we might see on a perfectly cloudless day.
However, most shadows in the real world have a gradual transition between lighter and darker areas. This soft edge is called a penumbra. Penumbras are caused by different factors related to the physics of light; even though most games model light sources as a dimensionless point source, in reality light sources have a surface. This surface area is what causes the shadow softness. Within the penumbra region, part of the light is blocked by an occluding object, but the remaining light has a clear path. This is the reason you see areas that aren’t fully in light nor are they fully in shadow.
The diagram below shows how we can calculate the size of a penumbra based on three variables: the size of the light source (R), the distance to the light source (L), and the distance between the occluder and the surface on which the shadow is cast (O). By moving the occluder closer to the surface, the penumbra is going to shrink.
Based on these variables, we derive a simple formula for calculating the size of the penumbra:
Using this straightforward relationship, we can formulate an algorithm to render accurate soft shadows using only one ray per pixel. We start with the hard shadows algorithm above; but when a ray intersects an object, we record the distance from the surface to that object in a screen-space buffer.
This algorithm can be extended to support semi-transparent surfaces. For example, when we intersect a surface, we can also record whether it’s transparent; if the surface is transparent, we choose to continue the ray through the surface, noting its alpha value in a separate density buffer.
This method has several advantages over cascaded shadow maps or other common techniques:
- There are no shadow map resolution issues since it’s all based in screen space
- There are no banding, noise, or buzzing effects due to sampling errors
- There are no biasing problems (sometimes called Peter-Panning) since you’re shooting rays directly off geometry and therefore getting perfect contact between the shadow and the casting object
Below, we show an example of the buffers that are generated by the ray tracing pass (we’ve uploaded the original frame). First, we have the ray-tracing density buffer. Most of the objects in the scene are opaque, therefore we have a shadow density of 1. However, the fence region contains multiple pixels that have values between 0 and 1.
Next up is the distance to occluder buffer. As we get further away from the occluding objects, the value of the red component increases, representing a greater distance between the shadow pixel and the occluder.
Finally we run a filter pass to calculate the shadow value for each pixel using these two buffers. To calculate, we determine the size of the penumbra affecting each pixel, use that penumbra to choose a blur kernel radius, and then blur the screen-space shadow density buffer accordingly. For a pixel that has a populated value in the distance to occluder buffer, calculating the penumbra is easy. Since we have the distance to the occluder, we just need to project that value from world space into screen space, and use the projected penumbra to select a blur kernel radius.
When the pixel is lit, we need to work a little harder. We use a cross search algorithm to locate another pixel with a ray that contributes to the penumbra. If we find any pixels on the X or Y axis that are in shadow (i.e,. have valid distance values), we’ll pick the maximum distance to occluder value and use it to calculate the penumbra region for this pixel; we then adjust for the distance of the located pixel and calculate the penumbra side.
From here on, the algorithm is the same: we take the size of the penumbra from world space and project it into screen space. Then we figure out the region of pixels that penumbra covers, and finally perform a blur. In the case where no shadowed pixel is found, we assume our pixel is fully lit. Below is a diagram representing our final filter kernel.
We’re covering the penumbra region with a box filter and sample it while still being aware of discontinuous surfaces. This is where depth rejection comes to our aid; to calculate the depth rejection values, we use local differencing to determine the delta between the current pixel and certain values on the X and Y axis. The result will tell us how far we need to step back as we travel in screen space. As we’re sampling our kernel, we’ll expand the depth threshold based on how far away we are from the center pixel.
In the example above, we’ve rejected all the samples marked in red because the corresponding area belongs to the fence and we’re interested in sampling a spot on the ground. After the blurring pass, the resulting buffer represents an accurate estimate of the shadow density across the screen.
Complete results: cascaded shadow maps vs. ray traced shadows
The images below compare an implementation of four-slice cascaded shadow maps at 2K resolution versus ray-traced shadows. In the ray-traced case, we retain the shadow definition and accuracy where the distance between the shadow casting object and the shadow receiver is small. By contrast, cascaded shadow maps often overblur, ruining shadow detail.
Clicking on the full resolution images reveals the severe loss of image quality that occurs in cascaded shadow maps.
In the second and third examples, we’ve removed the textures so we can highlight the shadowing.
Optimizing the ray tracing algorithms
The diagram below describes the initial implementation of the ray tracing hybrid model. The first optimization we can make is to cast fewer rays. We can use the information provided by dot (N, L) to establish if a surface is back facing a ray. If the dot (N, L) result is less than or equal to 0, we don’t need to cast any rays because we can assume the pixel is shadowed by virtue of facing away from the light.
Looking at the rendering pipeline, further optimizations can be made. The diagram below shows the standard deferred rendering approach. This approach involves many read and write operations and costs bandwidth (and therefore power).
The first optimization we’ve made is to reduce the amount of data in each buffer by using data types that don’t have any more bits than the bare minimum needed. For example, we can pack our distance density buffer into only 8 bits by normalizing the distance value between 0 and 1 since it doesn’t require very high precision. The next step is to collapse passes. If we use the framebuffer fetch extension, we can collapse the ray tracing and G-Buffer into one pass, saving all of the bandwidth of reading the G-Buffer from the ray emission pass.
Memory bandwidth usage analysis
Before we look at the final numbers, let’s spend some time looking at memory traffic. Bandwidth is the amount of data that’s accessed from external memory; memory traffic consumes bandwidth. Every time a developer codes a texture fetch, the shading cluster (USC) in a PowerVR Rogue GPU will look for it inside the cache memory. If the texture isn’t stored locally in cache, the USC will access DRAM to get the value. For every access to external memory, the chip will incur significant latency and the device will consume more power. When optimizing a mobile application, the developer’s goal is to always minimize accesses to memory.
By using specialized instruments to look at bandwidth use, we can compare cascaded shadow maps with ray-traced soft shadows on a PowerVR Wizard GPU. In total, the cascaded shadow maps implementation consumes about 233 MB of memory while the same scene rendered with ray-traced soft shadows requires only 164 MB. For ray tracing, there’s an initial one-time setup cost of 61 MB due to the acceleration structure that must be built for the scene.
This structure can be reused from frame to frame, so it isn’t part of the totals for a single frame. We’ve also measured the G-Buffer independently to see how much of the total cost results from this pass. Therefore, by subtracting the G-Buffer value from the total memory traffic value, shadowing using cascaded maps requires 136 MB while ray tracing is only 67 MB, a 50 percent reduction in memory traffic.
We notice similar effects in other views of the scene depending on how many rays we can reject, and how much filtering we have to perform. Overall, we get an average 50 percent reduction in memory traffic using ray-traced shadows.
Looking at total cycle counts, the picture is even better; we see an impressive speed boost from the ray-traced shadows. Because the different rendering passes are pipelined in both apps (i.e., the ray-traced shadows app and the cascaded shadow maps app), we can’t separate how many clocks are used for which pass. This is because portions of the GPU are busy executing work for multiple passes at the same time. However, the switch to ray-traced shadows resulted in a doubling of the performance for the entire frame.
Even though rasterized content today already looks pretty impressive, there are still big challenges to close the gap between the photorealism that ray tracing offers and what we see on current-generation mobile and desktop devices.