Pose estimation

Pose estimation is the task of using an ML model to estimate the pose of a person from an image or a video by estimating the spatial locations of key body joints (keypoints).

Get started

If you are new to TensorFlow Lite and are working with Android or iOS, explore the following example applications that can help you get started.

Android example iOS example

If you are familiar with the TensorFlow Lite APIs, download the starter MoveNet pose estimation model and supporting files.

Download starter model

If you want to try pose estimation on a web browser, check out the TensorFlow JS Demo.

Model description

How it works

Pose estimation refers to computer vision techniques that detect human figures in images and videos, so that one could determine, for example, where someone’s elbow shows up in an image. It is important to be aware of the fact that pose estimation merely estimates where key body joints are and does not recognize who is in an image or video.

The pose estimation models takes a processed camera image as the input and outputs information about keypoints. The keypoints detected are indexed by a part ID, with a confidence score between 0.0 and 1.0. The confidence score indicates the probability that a keypoint exists in that position.

We provides reference implementation of two TensorFlow Lite pose estimation models:

MoveNet: the state-of-the-art pose estimation model available in two flavors: Lighting and Thunder. See a comparison between these two in the section below.
PoseNet: the previous generation pose estimation model released in 2017.

The various body joints detected by the pose estimation model are tabulated below:

Id	Part
0	nose
1	leftEye
2	rightEye
3	leftEar
4	rightEar
5	leftShoulder
6	rightShoulder
7	leftElbow
8	rightElbow
9	leftWrist
10	rightWrist
11	leftHip
12	rightHip
13	leftKnee
14	rightKnee
15	leftAnkle
16	rightAnkle

An example output is shown below:

Animation showing pose estimation

Performance benchmarks

MoveNet is available in two flavors:

MoveNet.Lightning is smaller, faster but less accurate than the Thunder version. It can run in realtime on modern smartphones.
MoveNet.Thunder is the more accurate version but also larger and slower than Lightning. It is useful for the use cases that require higher accuracy.

MoveNet outperforms PoseNet on a variety of datasets, especially in images with fitness action images. Therefore, we recommend using MoveNet over PoseNet.

Performance benchmark numbers are generated with the tool described here. Accuracy (mAP) numbers are measured on a subset of the COCO dataset in which we filter and crop each image to contain only one person .

Model	Size (MB)	mAP	Latency (ms)
Model	Size (MB)	mAP	Pixel 5 - CPU 4 threads	Pixel 5 - GPU	Raspberry Pi 4 - CPU 4 threads
MoveNet.Thunder (FP16 quantized)	12.6MB	72.0	155ms	45ms	594ms
MoveNet.Thunder (INT8 quantized)	7.1MB	68.9	100ms	52ms	251ms
MoveNet.Lightning (FP16 quantized)	4.8MB	63.0	60ms	25ms	186ms
MoveNet.Lightning (INT8 quantized)	2.9MB	57.4	52ms	28ms	95ms
PoseNet(MobileNetV1 backbone, FP32)	13.3MB	45.6	80ms	40ms	338ms

Pose estimation

Get started

Model description

How it works

Performance benchmarks

Further reading and resources