README.md
1# Head-Tracking Library For Immersive Audio
2
3This library handles the processing of head-tracking information, necessary for
4Immersive Audio functionality. It goes from bare sensor reading into the final
5pose fed into a virtualizer.
6
7## Basic Usage
8
9The main entry point into this library is the `HeadTrackingProcessor` class.
10This class is provided with the following inputs:
11
12- Head pose, relative to some arbitrary world frame.
13- Screen pose, relative to some arbitrary world frame.
14- Display orientation, defined as the angle between the "physical" screen and
15 the "logical" screen.
16- Transform between the screen and the sound stage.
17- Desired operational mode:
18 - Static: only the sound stage pose is taken into account. This will result
19 in an experience where the sound stage moved with the listener's head.
20 - World-relative: both the head pose and stage pose are taken into account.
21 This will result in an experience where the sound stage is perceived to be
22 located at a fixed place in the world.
23 - Screen-relative: the head pose, screen pose and stage pose are all taken
24 into account. This will result in an experience where the sound stage is
25 perceived to be located at a fixed place relative to the screen.
26
27Once inputs are provided, the `calculate()` method will make the following
28output available:
29
30- Stage pose, relative to the head. This aggregates all the inputs mentioned
31 above and is ready to be fed into a virtualizer.
32- Actual operational mode. May deviate from the desired one in cases where the
33 desired mode cannot be calculated (for example, as result of dropped messages
34 from one of the sensors).
35
36A `recenter()` operation is also available, which indicates to the system that
37whatever pose the screen and head are currently at should be considered as the
38"center" pose, or frame of reference.
39
40## Pose-Related Conventions
41
42### Naming and Composition
43
44When referring to poses in code, it is always good practice to follow
45conventional naming, which highlights the reference and target frames clearly:
46
47Bad:
48
49```
50Pose3f headPose;
51```
52
53Good:
54
55```
56Pose3f worldToHead; // “world” is the reference frame,
57 // “head” is the target frame.
58```
59
60By following this convention, it is easy to follow correct composition of poses,
61by making sure adjacent frames are identical:
62
63```
64Pose3f aToD = aToB * bToC * cToD;
65```
66
67And similarly, inverting the transform simply flips the reference and target:
68
69```
70Pose3f aToB = bToA.inverse();
71```
72
73### Twist
74
75“Twist” is to pose what velocity is to distance: it is the time-derivative of a
76pose, representing the change in pose over a short period of time. Its naming
77convention always states one frame, e.g.:
78Twist3f headTwist;
79
80This means that this twist represents the head-at-time-T to head-at-time-T+dt
81transform. Twists are not composable in the same way as poses.
82
83### Frames of Interest
84
85The frames of interest in this library are defined as follows:
86
87#### Head
88
89This is the listener’s head. The origin is at the center point between the
90ear-drums, the X-axis goes from left ear to right ear, Y-axis goes from the back
91of the head towards the face and Z-axis goes from the bottom of the head to the
92top.
93
94#### Screen
95
96This is the primary screen that the user will be looking at, which is relevant
97for some Immersive Audio use-cases, such as watching a movie. We will follow a
98different convention for this frame than what the Sensor framework uses. The
99origin is at the center of the screen. X-axis goes from left to right, Z-axis
100goes from the screen bottom to the screen top, Y-axis goes “into” the screen (
101from the direction of the viewer). The up/down/left/right of the screen are
102defined as the logical directions used for display. So when flipping the display
103orientation between “landscape” and “portrait”, the frame of reference will
104change with respect to the physical screen.
105
106#### Stage
107
108This is the frame of reference used by the virtualizer for positioning sound
109objects. It is not associated with any physical frame. In a typical
110multi-channel scenario, the listener is at the origin, the X-axis goes from left
111to right, Y-axis from back to front and Z-axis from down to up. For example, a
112front-right speaker is located at positive X, Y and Z=0, a height speaker will
113have a positive Z.
114
115#### World
116
117It is sometimes convenient to use an intermediate frame when dealing with
118head-to-screen transforms. The “world” frame is an arbitrary frame of reference
119in the physical world, relative to which we can measure the head pose and screen
120pose. In (very common) cases when we can’t establish such an absolute frame, we
121can take each measurement relative to a separate, arbitrary frame and high-pass
122the result.
123
124## Processing Description
125
126
127
128The diagram above illustrates the processing that takes place from the inputs to
129the outputs.
130
131### Predictor
132
133The Predictor block gets pose + twist (pose derivative) and extrapolates to
134obtain a predicted head pose (w/ given latency).
135
136### Drift / Bias Compensator
137
138The Drift / Bias Compensator blocks serve two purposes:
139
140- Compensate for floating reference axes by applying a high-pass filter, which
141 slowly pulls the pose toward identity.
142- Establish the reference frame for the poses by having the ability to set the
143 current pose as the reference for future poses (recentering). Effectively,
144 this is resetting the filter state to identity.
145
146### Orientation Compensation
147
148The Orientation Compensation block applies the display orientation to the screen
149pose to obtain the pose of the “logical screen” frame, in which the Y-axis is
150pointing in the direction of the logical screen “up” rather than the physical
151one.
152
153### Screen-Relative Pose
154
155The Screen-Relative Pose block is provided with a head pose and a screen pose
156and estimates the pose of the head relative to the screen. Optionally, this
157module may indicate that the user is likely not in front of the screen via the
158“valid” output.
159
160### Mode Selector
161
162The Mode Selector block aggregates the various sources of pose information into
163a head-to-stage pose that is going to feed the virtualizer. It is controlled by
164the “desired mode” signal that indicates whether the preference is to be in
165either static, world-relative or screen-relative.
166
167The actual mode may diverge from the desired mode. It is determined as follows:
168
169- If the desired mode is static, the actual mode is static.
170- If the desired mode is world-relative:
171 - If head poses are fresh, the actual mode is world-relative.
172 - Otherwise the actual mode is static.
173- If the desired mode is screen-relative:
174 - If head and screen poses are fresh and the ‘valid’ signal is asserted, the
175 actual mode is screen-relative.
176 - Otherwise, apply the same rules as the desired mode being world-relative.
177
178### Rate Limiter
179
180A Rate Limiter block is applied to the final output to smooth out any abrupt
181transitions caused by any of the following events:
182
183- Mode switch.
184- Display orientation switch.
185- Recenter operation.
186