Take advantage of the Metal Counters API for GPU profiling in macOS Big Sur and iOS 14. This API provides access at runtime to low-level GPU profiling information, which was previously available only through offline tools in Xcode and Instruments. Metal Counters accelerate the optimization process by giving you access to important GPU information, helping you fine-tune your app's performance to create faster and more fluid apps and gaming experiences. Learn to collect and parse these low-level GPU timestamps and use the in-depth information to help with performance tuning in Metal.
Hi, I'm Lionel Lemarié from the GPU software team. In this session, we're going to use the Metal Counter API to get precise GPU timings at runtime.
We will cover a few topics. We'll start with a quick intro of the Metal Counter API.
Then we'll take a brief look at the main features of a typical live profiling HUD.
We'll use the API step-by-step to collect the profiling information.
And we'll conclude by looking at how this data fits into the HUD.
So let's start with a quick intro of the API.
The Metal Counter API is new in iOS 14. It was available in macOS Catalina and has been extended in macOS Big Sur. On iOS and macOS with Apple Silicon, it gives you access to stage boundary timings. That is, precise start and end times for vertex, fragment and compute passes. On Intel and AMD GPUs, you can get draw boundary timings, precise GPU timestamps even within individual passes.
Next, let's do a quick recap of the main features of a live profiling HUD.
You'd use a live HUD to track your app's performance at runtime. It can help you find problem areas that need to be investigated off-line in Xcode or Instruments.
You can also use them to tune your resolution and quality settings per device, for instance.
As an example, a typical live HUD may look something like this.
Frame times, as a scrolling line chart, to help you catch frame hitches, stats about your memory usage, resolution and more and the timeline of CPU events. And today, we'll add GPU events to the timeline.
Typically, for your CPU markers, you would instrument your most important functions using mach_absolute_time to get the start and end timestamps.
A good start with CPU markers is to put them around your command buffer work-- a start marker when you create it and an end marker when you commit it. That gives you an overview of the CPU costs of your rendering.
Now we want to add equivalent GPU markers. You may be familiar with the GPU events on the Metal System Trace timeline, as we're seeing here. On iOS in this example, it shows the different stages of the tile-based deferred rendering architecture. Here we see the vertex and fragment processing work.
To do this, the GPU firmware logs the start and end of the vertex stage.
Then it logs the start and end of the fragment stage.
And the Metal System Trace displays the stages on the timeline. For immediate mode GPUs, you can log events for groups of draw calls. For example, you would log the start of rendering object one before all of its draw calls and then log the start of rendering object two. And finally, the end of object two. You then have a timeline that shows precisely how long the GPU spends rendering each object.
Now let's use the Metal Counter API to achieve that.
We'll start by checking which counter sampling modes are available. We need to know whether the GPU should record at the stage boundary, for vertex, fragments or compute stages, or if it should record at the draw boundary, as we've just seen.
We simply use the supportsCounterSampling API to check if the current device supports stage boundary, for a TBDR GPU, or draw boundary for unlimited mode GPU.
Next we check which counter sets are available on the device. Counter sets include timestamps, stage utilization and pipeline statistics. For our GPU markers, we need the counter set that collects timestamps. So we enumerate all available counter sets on the device and choose the one for timestamps.
Once you have the right counter set, inspect it to ensure that it does have timestamps, as some devices may not support them.
We're done with the initial setup. Now let's see what's needed at runtime during each frame. There are just four easy steps. First we'll create a sample buffer with a size, storage mode and the counter set we just looked up. Then we'll add the sample buffer to the pass descriptor. Note that it means that you need at least one buffer per pass. Next, if we're using sampling at draw boundary, we'll add sampling commands at important points. Finally, in the completion handler, we'll resolve the counters. And we'll talk about aligning CPU and GPU timestamps if needed. Let's check out each step in detail.
First, we create a sample buffer using a descriptor. We specify the maximum number of samples it can hold, so it has the right size. We'll use six samples here, but you'll typically use more than that.
Then we set the storage mode. Shared mode is great here. It's not a lot of data, and it makes accessing the counters extra easy.
We specify the counter set to use, the one for timestamps. Finally, we create the sample buffer. So far, so good. Now we have a buffer for six samples.
As an example, let's use it in a render encoder.
You would do it the same for compute and blit encoders. For this, you use the sample buffer attachment from the render pass descriptor. If we are using stage boundary, this is where we set it up. We specify the start and end of the vertex stage. We're putting them at index zero and one in the sample buffer. That's how the GPU knows where to write each sample and how you know where to retrieve them from. Same for the start and end of the fragment stage.
Finally, we pointed to the sample buffer that we just created to store those samples.
To sample a draw boundary, you add sample commands at key points of your command stream.
The first obvious placement for these is before and after all draw calls. So after you've created a new encoder, you immediately add a sample command. We'll put it at index four, since we've already reserved the first four slots for stage boundary samples.
After all your draw calls, right before ending your encoders, you add a sample command at index five.
So the GPU will record a timestamp before and after all the work for that encoder. You can add more sample commands between groups of draw calls to mark important milestones. Just make sure your sample buffer is allocated with enough space in advance.
Speaking of which, we've allocated a buffer big enough for both stage boundary and draw boundary sampling. You can easily optimize it by allocating just enough for stage or draw boundary sampling in isolation, since they are mutually exclusive.
Right. The GPU has been instructed to sample timestamp counters at stage or draw boundary. Next, we wait for the rendering to complete, and in the handler, we collect the data.
Remember that we created a sample buffer per encoder. So in the common buffer completion handler, we may need to parse multiple sample buffers.
For each one, we resolve the counters. That translates different specific data into the unified Metal struct that's super easy to parse. Simply point to it and use the CounterResult struct. As we specified that vertexStart should be at index zero, we read it directly from there. Then we'll do the same for all other samples.
Note that some error checking is needed here. It's possible that the GPU failed to fill the sample buffer so you need to check that the result step collected the expected number of samples and that each sample is valid.
The GPU will use a predefined error value if it can't get a specific timestamp.
On iOS and Apple Silicon devices, the GPU timestamps are aligned to mach_absolute_time so you can directly compare them to CPU timestamps.
On Intel and AMD GPUs, an extra step is needed. They need to be translated from a vendor specific time domain. This is because depending on how busy the GPU is, how much power it consumes and how hot it's running, its clock frequency is constantly adjusted over time, which affects the timestamps.
To address this on immediate mode GPUs, you use the sample timestamp's API to query matching CPU and GPU timestamps at a given time.
You do it at regular intervals to avoid drifting and keep precise correlation over time. Then you do a simple linear interpolation of the samples collected.
As an example, you may call sample timestamps inside the common buffer completion handler so you get one correlation per frame.
Let's say you query CPU and GPU timestamps at t0. And then at the next frame, you query them at t1.
All GPU counters from the sample buffers can now be scaled and offset back into CPU domain.
And that's all we need. So let's look at how it could all be displayed together.
We're seeing the CPU markers. We captured them with mach_absolute_time. We're seeing vertex, fragment and compute stages all overlapping and aligned with CPU activity. You can even collect the mach_absolute_time inside the presented handler to align all the markers to actual glass-to-glass frames and precisely display all the events within each frame.
Using this HUD, you get a great view of whether you're CPU or GPU bound, your dependencies and sync points, the breakdown of vertex, fragment and compute work and how they affect each other.
All of that, live, right there inside your app.
There are a few things you can watch out for. Don't update the HUD too often. Just like an FPS counter, live data can be hard to read if it's constantly changing. You can collect the timestamps every frame but only update the markers on-screen once per second, for example. It makes it significantly easier to follow.
Secondly, GPU activity depends on its clock rate. Seeing a high GPU occupancy does not necessarily mean it is maxed out.
As the system only uses as much power as needed, it will balance power and performance.
As a consequence, you might see the GPU being 80% busy in the HUD. But if it is running at half the max clock rate, then it would actually be running at 40% of the peak performance and have plenty of headroom.
And as always, you should handle errors, but you should also watch out for inconsistencies.
For example, counters can overflow, which would cause a new value to be smaller than the previous one and may trigger a negative duration.
Or putting your device to sleep or hibernate while sampling counters may also cause large outliers. Those are rare events, and you should skip them gracefully to avoid glitches in your display or logs.
And so to recap, we've just gone through the steps to use the Metal Counter API to collect GPU timestamps. To do that, we used the device supportsCounterSampling method to find out which sampling modes are supported. We enumerated the counter sets to find the set with the GPU timestamps. We created a new sample buffer using its descriptor and used it in the render command encoder. You will want to do the same with a blit and compute encoder too. We added specific sample commands before and after all the draws. You can add them between draws, dispatches and blits too to get sub past timings.
We resolved the counters into CPU memory. And finally, we realigned them if needed.
And with that, you have all the data needed for a powerful, live GPU profiling HUD to display on top of your app.
And this API gives you access to more than just the GPU timestamps. You can get summarized per stage information, which is easier to process if you are not drawing the events on the timeline. And importantly, you can get some in-depth statistics, such as the number of invocations for vertex and fragment shaders and compute kernels and much more.
There's enough to explore in the Metal Counter API, and it gives you access to a lot of information to profile your GPU performance at runtime.
// After creating a new render command encoder
// All draw calls
// End encoding