Explore 3D body pose and person segmentation in Vision
Discover how to build person-centric features with Vision. Learn how to detect human body poses and measure individual joint locations in 3D space. We'll also show you how to take advantage of person segmentation APIs to distinguish and segment up to four individuals in an image.
To learn more about the latest features in Vision, check out “Detect animal poses in Vision” from WWDC23.
♪ ♪ Andrew: Hi, I'm Andrew Rauh, a software engineer on Vision framework. Today I'll be talking about Human Body Pose, use of Depth in Vision framework, and lifting people from images with instance masks.
Detecting and understanding people has always been a focus for Vision, and for a few years, Vision framework has offered Human Body Pose in 2D. As a refresher, Human Body Pose in 2D returns an observation with normalized pixel coordinates of landmark points defined on a skeleton corresponding to the input image. If you'd like to dive into more specifics, please review the "Detect Body and Hand Pose" session if you haven't already. Vision is expanding support for capturing people in their environment to 3D with a new request named VNDetectHumanBodyPose3DRequest. This request generates an observation that returns a 3D skeleton with 17 joints. Joints may be accessed by joint name or as a collection by providing a joint group name.
Unlike other recognized points returned by Vision that are normalized to a lower left origin, position of the 3D joints is returned in meters relative to the captured scene in the real world with an origin at a root joint. This initial revision returns one skeleton for the most prominent person detected in the frame. If you were building a fitness app and ran the request on this image of a workout class in a gym, the observation would correspond to the woman in the front who's closest to the camera. To better demonstrate the structure of the 3D skeleton with some context, let's break down this yoga pose. Unsurprisingly, the 3D Human Body skeleton starts with a head group which contains points at the center and top of the head. Next there's the torso group, which contains a left and right shoulder joint, spine, root joint, which is at the center of the hip, and hip joints. Keep in mind that some joints are returned in multiple groups. For arms, there are left and right arm groups, each with a wrist, shoulder, and elbow. Left and right are always relative to the person, not the left or right side of the image. Finally, our skeleton contains a left and right leg group, each with a corresponding hip, knee, and ankle joint. To use this new request, you follow the same workflow as other requests, so this flow should be familiar to you if you've used Vision in your code before. You'd start by creating an instance of the new DetectHumanBodyPose3DRequest, then initialize an image request handler with the asset you want to run your detection on. To run your request, pass your request instance into perform. And if the request is successful, a VNHumanBodyPose3DObservation will be returned without error. All photos are 2D representations of people in a 3D world. Vision now enables you to retrieve that 3D position from images without ARKit or ARSession. This is a powerful, lightweight option for understanding a subject in 3D space and unlocks an entirely new range of features in apps. I built a sample app to help understand and visualize this. When I open it up, I can select any image from my photo library.
My coworkers and I were inspired by the calmness of the yoga instructor earlier, so we took a break, went outside, and tried a few poses out ourselves. Now, I'm not as flexible as that teacher, but I did a pretty good job with this pose, and it should look great in 3D.
Let's run the request and bring me back into the third dimension.
The request is successful, and a 3D skeleton is aligned with where I am in the input image. If I rotate the scene, my arms extend out and legs look correct relative to the hip based on how I was standing. This pyramid shape represents where the camera was located when the image was captured. If I tap the Switch Perspective button, the view is now from the camera's position. I'll guide you through the code and concepts you need to know to create awesome experiences using 3D Human Body Pose in your app.
Building an app begins with using the points returned in the observation. There are two main APIs to retrieve them, recognizedPoint to access a specific joint's position or recognizedPoints to access a collection of joints with a specified group name. Besides these core methods, the observation offers some additional helpful information. First, bodyHeight gives an estimated height of your subject in meters. Depending on the available depth metadata, this height will either be a more accurate measured height or a reference height of 1.8 meters. I have a lot more to say about Depth and Vision in a minute. You can determine the technique used to compute height with the heightEstimation property. Next, the camera position is available though cameraOriginMatrix. Since in real life the camera may not be exactly facing your subject, this is useful to get an understanding of where the camera was relative to the person when the frame was captured. The observation also offers an API to project joint coordinates back to 2D. This is helpful if you want to overlay or align returned points with the input image. And finally, to get an understanding of how a person has moved across two similar images, an API is available to get the position of a given joint relative to the camera.
Before I show how to use the 3D Human Body points, I'd like to introduce the new geometry classes in Vision it inherits from. VNPoint3D is the base class that defines the simd_float 4x4 matrix for storing 3D position. This representation is consistent with other Apple frameworks like ARKit and contains all available rotation and translation information. Next, there is VNRecognizedPoint3D, which inherits this position but also adds an identifier. This is used to store corresponding information like a joint name. Finally, the focus of today is VNHumanBodyRecognizedPoint3D, which adds a local position and the parent joint. Let's go into some more specifics around how to work with properties of the point. Using the recognizedPoint API, I retrieved the position for the left wrist. A joint's model position, or a point's position property, is always relative to the skeleton's root joint at the center of the hip. If we bring our focus to the third column in the position matrix, there are values for translation. The value for y for the left wrist is 0.9 meters above this figure's hip, which seems right for this pose. Next, there is the returned point's localPosition property, which is the position relative to a parent joint. So in this case, the left elbow would be the parent joint of the left wrist. The last column here shows the value to be -0.1 meters for the x axis, which seems right as well. Negative or positive values are determined by the point of reference, and in this pose, the wrist is to the left side of the elbow. localPosition is useful if your app is only working with one area of the body. It also simplifies determining the angle between a child and parent joint. I'll show how to calculate this angle in code in a sec. When working with returned 3D points, there are several concepts that may be helpful when building your app. First, you often need to determine the angle between child and parent joints. In the method calculateLocalAngleToParent, the position relative to the parent joint is used to find that angle. Rotation for a node consists of rotation with respect to x, y, and z axes, or pitch, yaw, and roll. For pitch, a rotation of 90 degrees is used to position the SceneKit node geometry from its default orientation facing straight down to one more appropriate for our skeleton. For yaw, we use arc cosine of the z coordinate divided by the vector length to get the proper angle. And for roll, the angle measurement is obtained with arc tangent of the y and x coordinates. Next, your app may need to relate the returned 3D positions with the original image, like in my sample app. In my visualization, I use point-in-image API for two transformations to my image plane, a scale and a translation. First I need to scale my image plane proportionally to the returned points. I fetch the distance between two known joints, like center shoulder and spine, for both 3D and 2D, relate them proportionally, and scale my image plane by this amount. For the translation component, I use pointInImage API to fetch the location of the root joint in the 2D image. This method uses that location to determine a shift for the image plane for x and y axes while also converting between lower left origin of the VNPoint coordinate and the rendering environment origin at the center of the image. Finally, you may want to view the scene from the perspective of the camera or render a point at its location, and you can retrieve this from cameraOriginMatrix. Correct orientation will depend on your rendering environment, but this is how I positioned my nodes with this transform information using the pivot transform, which relates the local coordinate system of this node to the rest of the scene. I also used rotation information in cameraOriginMatrix to correctly rotate my image plane to face the camera with this code using inverse transform.
Since only rotation information is needed here, the translation information in the last column is ignored. Putting all these pieces together allowed for the scene displayed in my sample app. Now, I'd like to take a few minutes to discuss some exciting additions involving Depth in Vision. Vision framework now accepts Depth as an input alongside an image or frame buffer. VNImageRequestHandler has added initializer APIs for cvPixelBuffer and cmSampleBuffer that take a new parameter for AVDepthData. Additionally, if your file contains Depth data already, you may use the existing APIs without modification. Vision will fetch Depth from file automatically for you. When working with Depth in Apple SDKs, AVDepthData serves as the container class for interfacing with all Depth metadata. Depth metadata captured by camera sensors contains a Depth map represented as either Disparity or Depth format. These formats are interchangeable and can be converted to each other using AVFoundation. Depth metadata also contains camera calibration data, like intrinsics, extrinsics, and lens distortion, needed to reconstruct the 3D scene. If you need to learn more specifics, please review the "Discover advancements in iOS camera capture" session from 2022. Depth can be obtained through camera capture sessions or from previously captured files. Images captured by Camera app, like Portrait images in photos, always store Depth as disparity maps with camera calibration metadata. When capturing Depth in a live capture session, you have the added benefit of specifying the session to use LiDAR if the device supports it. LiDAR is powerful because it allows for accurate scale and measurement of the scene.
Vision is also introducing APIs to interact with more than one person in an image.
Vision currently offers the ability to separate people from the surrounding scene with GeneratePersonSegmentation request. This request returns a single mask containing all the people in the frame. Vision is now letting you be a bit more selective with a new person instance mask request. This new API outputs up to four individual person masks, each with a confidence score. So now you can select and lift your friends separately from an image. If you need to select and lift subjects other than people, you can use the subject lifting API in VisionKit or the foreground instance mask request in Vision framework. Please check out "Lift subjects from images in your app" session for more information. Here is some sample code showing how to select a particular instance of a person you want from an image. Currently it's specifying to return all instances, but you could choose instance 1 or 2, depending on what friend you'd like to focus on in the image, or use instance 0 to get the background. This new request segments up to four people, so if there are more than four people in your image, there are some additional conditions to handle in your code. When scenes contain many people, returned observations may miss people or combine them. Typically, this occurs with people present in the background. If your app has to deal with crowded scenes, there are strategies you can use to build the best experience possible. Face-detection API in Vision can be used to count the number of faces in the image, and you can choose to skip images with more than four people or use the existing Person Segmentation request and work with one mask for everyone.
To recap, Vision now offers powerful new ways to understand people and their environment with support for depth, 3D Human Body Pose, and person instance masks. But that's not all Vision is releasing this year. You can go beyond people and create amazing experiences with images of furry friends in "Detect animal poses in Vision" session. Thank you, and I can't wait to see what incredible features you build. ♪ ♪
Looking for something specific? Enter a topic above and jump straight to the good stuff.
An error occurred when submitting your query. Please check your Internet connection and try again.