Meet the Apple Immersive Video format - Meet with Apple - 视频 - Apple Developer

Meet the Apple Immersive Video format

Explore the foundations of the Apple Immersive Video format and learn how the system is engineered to preserve fidelity, presence of space, and world scale. This session introduces acuity targets, acquisition requirements, AIV file format structures, and foveation techniques. This session also focuses on the role of AIV specific metadata such as ILPD and motion comfort data in creating and playing back amazing AIV experiences.

This session was originally presented as part of the Meet with Apple activity “Create immersive media experiences for visionOS - Day 2.” Watch the full video for more insights and related sessions.

资源
- - 高清视频
  - 标清视频
相关视频

Meet with Apple
- 为 visionOS 打造沉浸式媒体体验 - 第 1 天
- 为 visionOS 打造沉浸式媒体体验 - 第 2 天
Good morning. I'm Ryan and I lead the production ecosystem team for Apple Immersive Video, and I am so excited that you guys are all here.
So our team partners with the AIV platform architecture team. These folks develop the feature rich technology that enables Vision Pro to deliver magical experiences.
Referencing these two teams helps us distinguish content production focused elements from developer focused playback elements of AIV.
And today, we're going to talk about what makes Apple Immersive Video different from everything else.
We'll start with Fidelity of Presence. The AV ecosystem strives to capture, render, and deliver the world in perfect fidelity. And as a reference point, 2020 vision is considered to be that number with the human can see. Remember. We'll come back to it a little bit later.
Next peripheral FOV or field of view. AIV delivers a field of view between 180 and 230 degrees. There are a few reasons for this.
The first viewer comfort. Just like when you sit back, relax and watch a 2D or 3D movie in Vision Pro, turning around to see something behind you doesn't feel very natural. So to complete the experience, AIV uses amazing spatial audio technology. We'll go a little bit deeper into that with spatial audio in the next section.
The second suspension of disbelief beyond 180 degrees AIV blends into the viewer's periphery. This helps maintain that profound sense of being there without reminding the viewer they're not actually there.
And third is efficiency. By keeping pixels in the viewer's normal field of view, AIV maintains a higher sense of presence said slightly differently. Each pixel is maximized for efficient streaming without compromising fidelity.
The next pillar dynamic bespoke projection. This is a fancy way of saying AIV has no default or standard projection. It's an interesting concept. This eliminates the need for converting Clips to lat long before you start editing, saving production time, render time, and above all, storage space. You don't have extra files. So instead, AIV uses very short experiences that carry the unique lens metadata from the live action camera or CG camera that was used to create each clip.
And last, and probably the most important pillar is world scale. This is that innate human ability to sense distance to an object, reinforcing that feeling of being there.
Since Vision Pro projects each pixel as it was captured, AIV is free from warping and stitching artifacts, meaning straight lines are straight and objects have natural roundness and accurate stereoscopic cues.
To do this, Vision Pro needs highly accurate lens calibration data per shot. That data comes from the elpd file or the immersive lens processing data file.
The Elpd file is a JSON, and it's only about 50kB and sometimes less. And despite that small size, Elpd carries everything to accurately reproject every pixel in an AIV frame.
So let's see how that works.
Each lens of an AIV enabled camera is individually profiled at the Camera manufacturers facility.
You can think of this as the optical fingerprint of that specific serial number camera.
This profile is then loaded into the camera and is added to every single clip it records.
For virtual cameras in DCC tools, there's even a specific ILP that parametrically defines an AIV projection specific to CG rendering.
So using this ILP with Apple Immersive enabled post-production tools, we can eliminate the need for manual lens solving. Terabytes of intermediate file generation and easy compositing. The feed of your CG character hit the ground of your live background. And this makes visual effects workflows that much easier.
So these are the fundamental pillars that make AIV different. But there's some other differences worth calling out.
And first we'll talk about capture size difference between AIV and 2D.
Let's start with frame rate. At minimum AIV is captured and played back at 90 frames a second at minimum, AIV is 7200 by 7200 per eye versus our normal 4K size of 4096 by, say, 2160.
And last AIV is always stereoscopic.
When you add this all up, you get a 44 times increase in the number of pixels required to achieve the benchmarks that we just talked about, about being there. That sounds like a really big number. A few years ago, but thanks to ProRes Efficiency Media Extensions for codec developers and Apple Silicon Performance, your AIV editing experience should feel like 2D.
Now, obviously folks don't stream in their production and archive format, so in a moment we'll talk about how to encode and package AIV for delivery.
Having covered what makes Apple Immersive Video different, let's go a little bit deeper into how AIV delivers fidelity of presence. And a hint it is all about acuity.
Traditionally we talk about videos in the k value. The problem is, the k value tells us only how many pixels are inside an image. Container tells us nothing about what the viewer is going to see at the end of the day in Vision Pro, so it's not a very useful benchmark for AIV quality.
So how do we define the resolution of an AIV file or format or experience? For that we use the term acuity.
And this term by itself doesn't really doesn't really help. It's not that useful until we have a scale and a reference point.
And for that we use the eye chart. So whether it's the classic eye chart or the contemporary eye chart that most people in this room have seen at the doctor's office or the optometrist, the principles have not changed since 1862.
And the important feature of any chart, the important feature of any chart is that the letters are subdivided. The optotypes are subdivided into a grid pattern and used correctly. The eye chart can tell us that someone with 2020 vision acuity can perceive about 60 pixels per degree. It's an interesting number.
From this, you can start to quantify that feeling of being there by using 2020 as the benchmark and PPD as the scale.
Fidelity to presence relies on perception of distance. How far into the world can you can I? Can we see that perception of distance creates that innate sense of connection to the world around us? If you've ever seen a video of someone receiving corrective lenses or Lasik for the first time, and most of us have, you know what that connection is, you know, the power of that connection and that perception of distance is a core function of acuity. And acuity is measured in PPD or pixels per degree.
So how does all of this relate to content? We'll do it. We'll do simple. If you take 180 degree lens and you put it on a standard AK camera, you get a 24 PD image or about 2065 Vision.
Now take that same camera and put your friend 25ft in front of it at 24 PD. You might be able to tell the difference between your friend's smile and your friend's grin, but at 60ft, that distance becomes a little different.
You can't tell the difference between a friend and someone dressed like your friend gets a little blurry. And that definitely falls short of being there. This is why AIV strives for 60 pixels per degree, and why AIV enabled cameras have a minimum of 40 PD, ensuring that content is captured as close to 2020 vision as possible, both for today and future.
If you've ever had the pleasure of working in traditional 3D or VR, you know it's complex. The workflows are complex. Cameras are complex, systems are complex. And so we believe that creating Apple Immersive Video shouldn't be tedious, complicated, or take years to learn. This is why AIV is simple and that simplicity becomes its superpower.
So the AIV ecosystem was designed to be simple. I mean really simple. Like all in one cameras that operate like a 2D camera that generate a single file. Workflows that feel like editing 2D, and an encode and delivery pipeline that feels normal when something feels simple on the surface. It's usually fairly complex under the hood, and that's why we call Apple Immersive Video a compound format.
AIV abstracts the complexity, starting with managing all the various types of metadata in AIV production, like the lens calibration we just talked about a second ago, or per frame motion data. And we'll definitely come back to that one a little bit later. Or dynamic transitions, video transitions rendered by Vision Pro in real time. And there's a few of these. And we do this for a coder. Efficiency, efficiency and also some really cool features of the format.
Going a little deeper, let's look at the basic structure of the two AIV file types. Production and delivery developers will recognize this as the QuickTime file format. Creatives. You'll recognize this as the MOV file. Just remember they are one and the same.
So it's worth noting that developers like Blackmagic chose to store AIV data in their own format like braw, and also for visual effects workflows. AIV takes advantage of multi-part EXR, and so regardless of the file type that we use to store everything, a few fundamental things are always the same.
All video independent individual video tracks audio tracks along with all the per frame metadata. Even usdz assets. Background assets are stored and transported in a single file.
This means everything is always in sync and always travels together.
No Sidecar files, no groups of files to keep track of. And this makes AIV feel like working in 2D.
So quick example a very simple example. You're an editor and you may be working on the timing of a shot, and you're looking at a single lens view. But before you lock your edit, you want a very quickly verify that there's no smudge on either lens. So you quickly click a button, do your checks, and go back under the hood.
The NLE is decoding two individual frames two individual tracks. It's not a side by side file.
This is just a simple example of how AIV keeps everything at your fingertips without unnecessary files cluttering your timeline. Traditionally, you'd have to link together a single file, a side by side file, and maybe something else, like a proxy to make this work effectively.
Now that we've talked about AIV production format using ProRes uncompressed or Raw, let's talk about AIV delivery and specifically the AIV file. Developers, once again, keep in mind this is QuickTime under the hood.
I'll come back to media prep and packaging in a moment. But it's important to keep in mind that the Ivue file is intended for sharing and playback on Vision Pro. It's not intended for editing or archive. That being said, because it's QuickTime under the hood, as a developer, you could add editing support to your apps for this.
So let's walk through how the file is built. First, the individual ProRes tracks are transcoded into a lightweight video track that MV means multi-view or left and right views. And as the name suggests, it is a variant of the well understood and highly efficient Hevc codec family.
Next, uncompressed spatial audio tracks and associated metadata are encoded into lightweight APAC audio tracks. We hear more about spatial audio, the spatial audio format. ASAF the Apple positional audio codec. APAC. In the next session, both of these assets will come from your NLE if it's enabled with audio mixing or a Da to do your final mix.
Next is presentation track. This track stores all the metadata that signals per clip dynamic changes, calibration changes, fade transition effects the real time effects in Vision Pro.
You can think of this as a real time EDL or real time Fxml.
Developer developers. You'll know this as timed metadata and QuickTime.
And like audio presentation track data comes from your Apple Immersive enabled NLE or DCC.
And finally the Amy file. This is where all of the camera calibrations, edge blends, Backgrounds and all the other AIV experience metadata is stored. This too will come from immersive enabled and or DCC.
Now that we've talked about production and delivery formats, let's talk about best practices and the technology used to prep AIV media for delivery. Here again, the name of the game is Acuity Preservation.
So first, the best practices to preserve acuity. This simply means not doing things that will negatively impact acuity, like converting AIV into other projection types like latlong or downscaling. Camera source media file before the encoder final delivery. Let that happen in the encoder or shooting with excessively high ISOs in your cameras.
Second best practice is image prep. This includes processes like image noise reduction that help with encoder efficiency, or even AI assisted edge detail enhancement that can help preserve acuity when streaming at really low data rates.
Now we'll get into the technologies that are specifically built to help preserve acuity for AIV delivery. And the first is foveation.
This is a special imaging process and technique for preserving the most essential data while recording AIV image sizes or reducing AIV images image sizes for encoding.
Obviously, like we said before, it's impractical to deliver a nearly 11 K by 11 k image to Vision Pro, so we need to make this fit into a target delivery size of 4320 by 4320 per eye. Again, the name of the game is efficiency, but as you can see, a simple downscale decimates the image to 24 PD. We talked about earlier and that's less than ideal.
So to preserve acuity we want to preserve the most important parts of the image. And for that we'll employ Apple Immersive Video Foveation. In this step we'll take advantage of the positive effects of oversampling.
And as Tim Dashwood mentioned yesterday, the goal of AIV foveation is to achieve a is not to achieve a specific PD target, but instead, like the chart shows, it's to help you balance acuity and pixel preservation to meet your creative needs. And because AIV foveation is not one size fits all, you can use AIV per clip calibration feature to apply user or developer tuned Foveation patterns to each clip. That's a superpower.
And that brings us to the most critical AIV technology the MV HEVC tuned encoders. Specifically tuned encoders using an AV tuned MV Hevc encoder combined with Foveation. You can take that 4320 by 4320 ProRes file and encode it to be roughly the same data rate of your typical 2D, 4K 24 frame HEVC file without sacrificing that feeling of being there.
As I said earlier, I was going to come back to one of my favorite parts of AIV. So let's talk about Motion data for the content creator.
As Eliot mentioned yesterday, planning for Motion and AIV is a responsibility and a tool. AIV Motion metadata can be built into every single shot, so you can take advantage of it throughout the decision making process and the moment in your scenes that really, really matter.
In an Apple Immersive enabled NLE, you and your crew can visualize AIV camera motion before you review it in Vision Pro. This is a quick example of what it looks like in Resolve's timeline.
There's also a positive and creative benefit to visualizing motion data in the timeline.
Previously, creators have had to rely on emotional and psychological BeatsX to complete their narrative and story arcs. With AIV and the visualization of Motion data. AIV also lets you use this as a story tool by visualizing physical camera motion in your timeline. Creators can can craft better emotional beats and arcs to your story before even visualizing it in Vision Pro and not to be left out CG animation or CG content. Animated content for AIV has the exact same principles Motion applies in both mediums.
Needless to say, this is a lot of information. But to sum it up, delivering that feeling of being there requires a number of things really high standards, simplicity, best practices, and purposeful technology. But that is only 20% of the story. Audio is the other 80% of the immersive experience story. So to talk about Apple's new spatial audio technologies, I'll hand it over to Deep Sen.