From the blog2026-05-28

How AirPods posture tracking actually works.

Your AirPods know when your head is tilted. You’ve seen it — spatial audio recenters when you turn to look at someone, head-tracked notifications glance off in the right direction, the little blue dot in Find My moves with you. That’s motion sensing, working quietly inside an earbud the size of a peanut. So a fair question, the one we get most often: if AirPods can tell that your head is tilted, can they tell that you’re slouching? The answer is half. They can see the angle. They can’t see the context that makes an angle meaningful. A camera, oddly, has the opposite problem — it sees the whole scene but can’t pin down exactly where your head is in space without making confident guesses about pixels that aren’t actually a person. Neither sensor, used on its own, can tell the difference between a forward-head slouch and you bending down to pick up a coffee mug. That’s not a marketing problem. It’s a physics problem, and a sensor problem, and the only honest answer is to use both. This piece is about what each sensor measures, what each one misses, and why the fusion of the two is the only way to get a posture estimate that doesn’t lie to you four times an hour.

What CoreMotion gives us

CoreMotion is Apple’s framework for reading motion data out of any sensor Apple builds, and the relevant sensor here is the inertial measurement unit — the IMU — tucked inside every recent pair of AirPods. An IMU is two chips working together: an accelerometer that measures linear acceleration on three axes, and a gyroscope that measures rotation on the same three axes. Combined and integrated over time, they give you the orientation of whatever they’re strapped to in three-dimensional space. In an AirPod, that thing is your head.

CoreMotion exposes that orientation to apps as three numbers: pitch, roll, and yaw. Pitch is the up-and-down tilt of your head — the angle of nodding yes. Roll is the side tilt — the angle of cocking your head to listen. Yaw is the left-right rotation — the angle of shaking no. All three are returned in radians, sampled at roughly 25 Hz, with end-to-end latency in the neighborhood of forty milliseconds from when your head moves to when an app on your Mac sees the number change. For posture work, the one that does almost all the heavy lifting is pitch. Forward head posture is, mechanically, an excess of forward pitch held for too long. Slouching is the same shape with a longer tail. Reading on a phone is the same shape, more extreme, less sustained.

One detail that matters: pitch is meaningless without a baseline. Everyone’s neutral head position is slightly different. Some people sit with their chin a touch lifted, some with it slightly tucked, and that resting offset isn’t pathological — it’s just where their skeleton lives. So before Sitful can call anything a forward-head event, it has to know where your zero is. The first time you run the app, there’s a fifteen-second baseline audit: you sit the way you sit when you’re working well, the app averages the pitch signal across that window, and the result becomes your personal neutral. Every later reading is reported as a delta from that number, not an absolute. Calibration is the unglamorous part of any sensor system. It’s also the part that makes the rest of the system honest.

What CoreMotion alone misses

The IMU sees the angle of your head. It does not see anything else. Not your shoulders, not your screen, not the room, not whether you’re even sitting at a desk. From the IMU’s perspective the world is a single number that goes up when your chin drops and down when your chin lifts, and it has no way to ask why. That ignorance is the whole problem with a pure-IMU posture system.

Think about everything you do during a work day that involves tilting your head forward and is not slouching. Taking a bite of a sandwich. Reading a paperback in your lap during a long call. Bending to scratch the dog who’s shoved his head onto your keyboard. Looking down to find your phone in your bag. Typing a particularly nasty Slack message at a slight forward lean because you’re concentrating. Looking at a notebook on your desk. Picking a piece of food off your shirt. To an IMU these are indistinguishable from a slouch — the pitch goes forward, it stays forward for a few seconds, the alert fires. The user, who knows perfectly well they were reading and not slouching, learns within a week that the alerts are unreliable, and they start ignoring them. Then they uninstall.

This isn’t a hypothetical. It’s the dominant complaint in every review of every AirPods-only posture app on the App Store: alerts misfire when the user is eating, when they’re reading on a tablet, when they’re writing in a notebook, when they bend down to pick up a pen. The IMU is right about the geometry. The user’s head did tilt forward. The system is just answering the wrong question, because the right question isn’t “is your head tilted” — it’s “are you slouching at your screen.” Those two questions only have the same answer when you happen to be at your screen.

What the camera gives us

A camera, paired with the right model, can answer the context question that the IMU can’t. The model Sitful uses is Google’s MediaPipe Pose Landmarker, a piece of on-device computer vision originally built for Pixel fitness apps and now shipped as a portable library that runs anywhere WebAssembly does — including, in our case, inside a sandboxed worker on your Mac. WebAssembly — or WASM — is a low-level binary format that lets serious compiled code run safely outside the browser and the cloud, on the same machine you’re reading this on.

MediaPipe takes a webcam frame and returns thirty-three labeled points on the body: the top of the head, the inner and outer corners of each eye, both ears, both shoulders, both elbows, both wrists, the hip joints, the knees, the ankles. Each point comes with a confidence score and an (x, y, z) position in image space. The frame rate Sitful runs at is roughly fifteen frames a second, with model latency around twenty-five milliseconds per frame on Apple silicon. That’s slower than the IMU’s 25 Hz, but it’s a different kind of data — it sees the whole upper body rather than a single number.

From those landmarks Sitful derives the signals that actually correlate with posture: shoulder roll, the angle of the line between your two shoulders against the horizontal; shoulder-to-ear distance, which is the most reliable proxy for slouching in body-landmark data; screen distance, estimated from the apparent size of the shoulder span in the frame; and posture-axis alignment, which is the angle of the line from your hip midpoint up through your shoulder midpoint to your head. None of those signals require the camera to know what your face looks like, and none of them are stored anywhere. The frame comes in, the landmarks come out, the frame is discarded in the same millisecond. The thirty-three points are processed in memory, the posture score is updated, and the bytes that held the image are overwritten. Nothing is saved to disk. Nothing is sent over the network. The camera light goes off the moment you close the laptop lid.

What the camera alone misses

The camera’s failure mode is the opposite of the IMU’s — it sees too much, and it sees confidently. MediaPipe is a person detector that has been trained to find people. It is good at this. It is, in fact, slightly too good. Give it a chair with a hoodie draped on the back and it will produce a plausible skeleton. Give it a pile of pillows in roughly the right shape and the same thing happens. We’ve watched it score the upholstered back of an Aeron chair as a 0.7-confidence person, sitting forward, slouching badly, with no actual human anywhere in the frame.

That puts a camera-only system in a bad spot. Lower the confidence threshold and you start nudging an empty chair every twenty minutes. Raise the threshold and you start ignoring real users who happen to be sitting at an angle, who have a window glare across their face, who are partially out of frame because the laptop is shoved to the left of the desk. Both failure modes punish the user. One trains them to think the app is paranoid. The other trains them to think it’s asleep. Neither is the kind of thing you can fix by retraining the model harder — the model is doing what it was asked to do. It’s the architecture, asking a single sensor to answer too many questions, that’s wrong.

Why fusion is the right answer

The two sensors fail in different directions, and that’s the whole reason fusing them works. The IMU is precise about angle and blind to context. The camera is rich on context and unreliable about whether the thing it’s seeing is even a person. Layer them and each one’s weakness is the other one’s strength. Neither sensor needs to be perfect. They just need to be wrong about different things.

The way this plays out in Sitful is a handful of rules, written in plain language, not mathematical sleight of hand. The first: if AirPods report forward pitch beyond your baseline AND the camera shows shoulders rolled in AND your hip-to-shoulder-to-head line is bent forward AND the screen-distance estimate says you’re actually at the laptop — call it slouching. All four conditions, all at once, sustained for more than a few seconds. That’s the high-confidence slouch event and it’s the only one that earns a nudge.

The second rule: if AirPods see a forward tilt but the camera shows your whole torso has translated forward in the frame — you’re leaning out of the chair to reach for the coffee mug, not collapsing into it — throw the event away. The geometry is similar. The intent isn’t. The camera can see the difference because the shoulder landmarks move with the head; the IMU can’t, because from the head’s perspective the world looks the same either way.

The third: if the camera reports a person but no AirPods motion has been seen in thirty minutes — not a single pitch tick, not a single roll — treat the “person” as suspect. People move. They breathe. They scratch their nose. An IMU strapped to a real human will produce micro-motion constantly. An IMU sitting on a desk next to a chair-with-hoodie will produce nothing. Cross-check the camera against the IMU’s liveness signal and the empty-chair false positive disappears almost entirely.

The fourth: if AirPods aren’t in your ears at all, fall back to camera-only mode and raise the confidence threshold a notch. You lose the liveness check. You also lose the angular precision. What you keep is a working system that’s honest about its reduced certainty — and that’s the philosophical point underneath all of this. Fusion isn’t a clever feature. It’s an admission that any single sensor is incomplete, and the only honest posture estimate is the one that triangulates between the things you actually have. Two flawed sensors, treated as a panel rather than oracles, do better than either one pretending to be the truth.

Hardware compatibility

Not every pair of AirPods has the IMU we need. The ones that do: AirPods Pro Gen 1 and Gen 2, AirPods Max, AirPods 3rd generation and 4th generation, and the recent Beats models with the H1 or H2 chip — Beats Studio Pro, Powerbeats Pro 2, Beats Fit Pro. Those all expose head-tracking data to CoreMotion and Sitful will use them automatically the moment they connect. The ones that don’t: the original AirPods and the AirPods 2nd generation. They’re a great pair of earbuds; they just predate the IMU that makes any of this work. If Sitful detects that the connected AirPods don’t expose motion data, it quietly switches to camera-only mode and bumps the confidence threshold to compensate for the missing liveness signal. You don’t need to configure anything. The app figures it out and tells you what mode it’s in.

Why on-device matters

Both of these signals are deeply personal. A pitch trace from your AirPods is a record of every nod, every glance, every time you turned to look at someone over the course of a working day. A stream of MediaPipe landmarks is your body, abstracted into thirty-three points, eight hours a day. Either one, in the wrong hands, is more intimate than most of what your phone collects about you in a year. The only place to honestly process that data is the machine in front of you.

A cloud-based posture tracker would have to send something somewhere. Send the video and you’ve given a third party a camera on your face all day, which is unacceptable even if you trust the third party. Send only the landmarks and you’ve sent a thirty-three-point puppet of yourself that, with a little work, can be reanimated into a recognizable rendering of the original frame. There is no clean version of off-device posture tracking. We thought about it for a long time. There just isn’t.

So Sitful runs both pipelines on your Mac. The MediaPipe model runs in a sandboxed WebAssembly worker that has no network access at all — we revoke the permission at the runtime level, not just by policy. The CoreMotion stream is read by a native subscriber and lives in memory for the few milliseconds it takes to update the score. The frames and the motion samples are gone within a single tick of the event loop. The only thing that gets written to disk is the resulting posture score and a breakdown of which components drove it, so you can look at your day later and see what the system saw. You can read the full breakdown of every byte we touch on our privacy page.