Streaming is available in most browsers,
and in the WWDC app.
Metal 2 on A11 - Imageblock Sample Coverage Control
Imageblock sample coverage control provides access to multisample tracking data within a tile shader, enabling development of custom MSAA resolve algorithms and more. Understand how the A11 GPU tracks unique samples, then explore an example that optimizes rendering of dense geometry through surface aggregation.
Building upon Imageblocks and tile shading, Imageblock Sample Coverage Control gives you new opportunities to optimize your multisampled render passes and is designed for the enhanced multisampling hardware in the A11's GPU. Imageblock Sample Coverage is the third in a series of presentations focused on new Metal 2 features on A11.
Before we dive into the enhanced multisampling features in Metal 2, let's have a quick refresher on Multisample Antialiasing. Multisample Antialiasing, or MSAA, is a technique used to improve the appearance of primitive edges by representing each pixel with multiple depth and color samples. First let's take a look at how a GPU renders a triangle without multisampling. Here's a four by four grid of pixels representing your color attachment.
The GPU's rasterizer samples the center of a pixel to determine if it is covered by a primitive. Let's bring in a triangle.
All the pixel centers covered by the primitive are colored red. If each pixel only has a single sample position, a pixel is classified as covered or not covered based only on the pixel center. You can see in the resulting image the classic symptom of edge aliasing: the staircase artifacts also known as jaggies. Let's see the same triangle rendered with multisample antialiasing. For this example, each pixel has four evenly distributed sampling positions. With four samples per pixel, the GPU's rasterizer can determine finer grained coverage for the primitive. When a multisampled attachment is resolved, the GPU will average the values of the samples to determine the final color of each pixel. This results in a smoother edge and reduces the appearance of the staircase effect. In traditional multisampling implementations, the tradeoffs for the improved edge appearance are the computational overhead of blending each sample, the larger memory footprint of multisampled attachments textures, and the higher memory bandwidth to store and resolve multiple samples per pixel. Apple's A-Series GPUs have a very efficient MSAA implementation that directly addresses these tradeoffs. The hardware tracks whether each pixel contains a primitive edge so that your blending executes per sample only for the pixels that have difference sample values. With Metal on A-series GPUs, you can eliminate the extra memory storage requirements by using memoryless render targets for the multisampled attachments. The full sample data only exists temporarily in tile memory.
By using Metal's multisample resolve store action, you avoid incurring any additional system memory bandwidth by directly resolving from tile memory to the resolve attachment. In addition, Metal 2 introduced Programmable Sample Positions to allow you to choose the sample locations and control your sampling pattern. With Metal 2 on A11, we made multisampling even more efficient at blending. While current A-series GPUs already track whether edges intersect each pixel, the A11 GPU extends this tracking to an even finer granularity by tracking the number of unique samples within each pixel. Even without using any other Metal 2 features, your existing multisampled applications will blend more efficiently with A11's GPU's enhanced multisampling hardware.
Leveraging the flexibility of other Metal 2 features such as Imageblocks and tile shading, Imageblock Sample Coverage Control gives you access to each pixel's sample coverage tracking data for even more control of your multisampled render passes. With Imageblock Sample Coverage Control, your tile pipelines can resolve sample data at any time in a render pass using your own custom resolve algorithms.
To understand the additional flexibility provided by Imageblock Sample Coverage Control, you first need to understand how edge tracking works in the A-Series GPUs. Current A-series GPUs rasterize your scene in tiles, and each tile contains metadata that tracks whether a pixel contains a primitive edge.
In this image, the red pixels contains primitive edges, and the white pixels do not. The number of edge-containing pixels increases relative to the density of your scene's geometry. When a pixel contains a primitive edge, that means it has multiple unique sample values. For this reason, the blend equation executes per sample for edge-containing pixels. Pixels that do not contain an edge need to blend only once.
The A11 GPU tracks edges with a finer granularity. Pixels that contain edges often only have a few unique samples. The A11 GPU's enhanced multisampling tracks and blends only the unique samples. In the Metal Shading Language, we give a name to these unique samples: colors.
So let's take a look at how the A11 GPU improves multisampled blending performance. For pixels that contain a primitive edge, the A11 GPU will track the number of unique colors in that pixel. As primitives intersect or completely cover a pixel, the number of colors contained in that pixel will grow and shrink. The A11 GPU tracks these transitions automatically. In the diagram on the right, the pixel contains four samples, but only two colors. When the A11 GPU needs to blend the next primitive with this pixel, there are only two unique samples to blend. And for advanced rendering algorithms that use programmable blending, the savings will be significant.
So let's take a fragment through a few of these color tracking transitions. Initially, each fragment contains a single color, representing all samples. This would be your clear color when a render pass begins. If a primitive edge cuts through a pixel, the A11 GPU will create a new unique color and assign the covered samples to the new color. The two samples covered by this green triangle are assigned to the new color one. The samples that are not covered are still assigned to the color zero. Let's say the next primitive that intersects this pixel is a red translucent triangle. The red triangle covers three samples. Current A-Series GPUs would blend each of the three covered samples. The A11 GPU, blends only twice because two of the covered samples share the same color index. In this case, color one is a blend of green and red, and the GPU creates a new color at index two since that is a new, unique color. The number of unique colors in a pixel can grow when a primitive cuts through a pixel. But there are times when the hardware will reduce the number of unique colors. Here, an opaque, non-blending triangle entirely covers the pixel. All four samples are completely replaced by blue, so the A11 GPU will merge the three colors back to one, since all the blue samples can be represented by a single color again. The enhanced mutlisampling hardware in the A11 GPU is so powerful that we extended the Metal shading language to give you explicit control over sample coverage with Imageblock Sample Coverage Control. With this new feature, tile pipelines have the capability to resolve sample data in place in the middle of a render pass by changing the color coverage of the pixel. And since you write the kernel in Metal Shading Language, that means you can write your own custom resolve filters. Let's go through a simple example. First, we have a kernel that has an imageblock argument. Next, we query the number of colors at a given coordinate in the imageblock. For a render pass with four samples per pixel, the value returned can be one, two, three, or four, depending on how many unique colors are at that pixel. A multisampled imageblock can return an imageblock data for each sample or color. In this example, we will get the imageblock data for color c. Since this example is looping over the number of unique colors, we also need to consider the number of samples that are covered by each color. We do this by getting the coverage mask for this color index and calling pop count to get the number of set bits in the mask. Next, we finish resolving our color by dividing by the number of samples per pixel and writing the resolved value back to the imageblock with a full sample mask. By writing a single value with a full sample mask, the A11 GPU will merge all the sample data back to a single color. Now this is an example of a basic resolve, but since it is a tile pipeline, you can write a kernel to resolve your sample data in a way that best fits your application. So you just saw an example of writing a custom resolve filter. Let's discuss another reason to use tile shading to resolve sample data. Now, some applications render complex scenes with lots of opaque geometry and lots of translucent geometry like particles. While the A11 GPU will do its best to blend only the unique colors for each pixel, if you know that your scene has a lot of blended geometry, with large amount of overdraw, you may want to resolve your sample data with a tile pipeline before the heavy blending phase. With Imageblock Sample Coverage Control, you can resolve the sample data with a tile pipeline after rendering your opaque geometry to ensure that all pixels contain a single unique color prior to blending. Let's look at a more advanced example of using a tile pipeline to change the coverage of the sample data. Since the tile pipeline can be implemented with the compute function, you can do much more than simply average values together. Our Surface Aggregation sample app starts with a multisampled single-pass deferred shading algorithm and uses a tile-based kernel dispatch to reduce the number of shaded samples in the deferred pass. The goal of this algorithm is to shade fewer samples in the expensive deferred pass while retaining the edge-smoothing benefits of multisample antialiasing. We won't dive into all the details of the algorithm, so be sure to download and explore our Surface Aggregation sample app. But now let's visualize how this technique reduces the cost of shading. The two images visualize the pixels containing more than one sample per pixel in the g-buffer. The image on the left shows the g-buffer before merging surfaces, and the image on the right shows the g-buffer after merging surfaces. The surface aggregation kernel is able to reduce the number of g-buffer samples that need to be shaded. As you can see on the right image, the only pixels containing multiple unique samples are on true creases and depth boundaries. Before Metal 2 on the A11 GPU, this algorithm would require a separate render pass for each phase of the algorithm, incurring multiple round trips to system memory. But with Imageblock Sample Coverage Control, all three phases of the algorithm can be merged into one render pass, saving your app a ton of memory bandwidth. As you can see in the diagram, all three phases operate on the imageblock keeping all the working data inside tile memory. First, you will render the g-buffer to the imageblock in tile memory. And next, you will dispatch the surface aggregation tile pipeline to reduce the number of g-buffer samples into fewer aggregate g-buffer samples. Finally, the deferred shading pass will only shade each aggregate sample.
If you're interested in learning more about this technique, please visit the link at the end of this presentation to download the sample app.
To recap, we first talked about the hardware enhancements made to multisampling in the A11 GPU. The A11 GPU tracks the number of unique samples in every pixel to reduce the cost of blending. This optimization applies to both API blending as well as programmable blending. We then discussed the enhancements in Metal 2 for the A11 GPU that expose this powerful hardware feature to kernels used in tile shading. With Imageblock Sample Coverage Control, you can write your own custom resolve kernels and dispatch them at any time in a render pass to implement powerful new optimizations. Together, the enhanced multisampling in the A11 GPU and the new shading language features in Metal 2 enable new techniques to keep your data on-chip longer. You can use this feature to implement algorithms like Surface Aggregation in a single render pass. For more information about Metal 2 and links to the sample code, please visit the developer website at developer.apple.com/metal. Thank you for watching.
Looking for something specific? Enter a topic above and jump straight to the good stuff.
An error occurred when submitting your query. Please check your Internet connection and try again.