1# Head-Tracking Library For Immersive Audio 2 3This library handles the processing of head-tracking information, necessary for 4Immersive Audio functionality. It goes from bare sensor reading into the final 5pose fed into a virtualizer. 6 7## Basic Usage 8 9The main entry point into this library is the `HeadTrackingProcessor` class. 10This class is provided with the following inputs: 11 12- Head pose, relative to some arbitrary world frame. 13- Screen pose, relative to some arbitrary world frame. 14- Display orientation, defined as the angle between the "physical" screen and 15 the "logical" screen. 16- Transform between the screen and the sound stage. 17- Desired operational mode: 18 - Static: only the sound stage pose is taken into account. This will result 19 in an experience where the sound stage moved with the listener's head. 20 - World-relative: both the head pose and stage pose are taken into account. 21 This will result in an experience where the sound stage is perceived to be 22 located at a fixed place in the world. 23 - Screen-relative: the head pose, screen pose and stage pose are all taken 24 into account. This will result in an experience where the sound stage is 25 perceived to be located at a fixed place relative to the screen. 26 27Once inputs are provided, the `calculate()` method will make the following 28output available: 29 30- Stage pose, relative to the head. This aggregates all the inputs mentioned 31 above and is ready to be fed into a virtualizer. 32- Actual operational mode. May deviate from the desired one in cases where the 33 desired mode cannot be calculated (for example, as result of dropped messages 34 from one of the sensors). 35 36A `recenter()` operation is also available, which indicates to the system that 37whatever pose the screen and head are currently at should be considered as the 38"center" pose, or frame of reference. 39 40## Pose-Related Conventions 41 42### Naming and Composition 43 44When referring to poses in code, it is always good practice to follow 45conventional naming, which highlights the reference and target frames clearly: 46 47Bad: 48 49``` 50Pose3f headPose; 51``` 52 53Good: 54 55``` 56Pose3f worldToHead; // “world” is the reference frame, 57 // “head” is the target frame. 58``` 59 60By following this convention, it is easy to follow correct composition of poses, 61by making sure adjacent frames are identical: 62 63``` 64Pose3f aToD = aToB * bToC * cToD; 65``` 66 67And similarly, inverting the transform simply flips the reference and target: 68 69``` 70Pose3f aToB = bToA.inverse(); 71``` 72 73### Twist 74 75“Twist” is to pose what velocity is to distance: it is the time-derivative of a 76pose, representing the change in pose over a short period of time. Its naming 77convention always states one frame, e.g.: 78Twist3f headTwist; 79 80This means that this twist represents the head-at-time-T to head-at-time-T+dt 81transform. Twists are not composable in the same way as poses. 82 83### Frames of Interest 84 85The frames of interest in this library are defined as follows: 86 87#### Head 88 89This is the listener’s head. The origin is at the center point between the 90ear-drums, the X-axis goes from left ear to right ear, Y-axis goes from the back 91of the head towards the face and Z-axis goes from the bottom of the head to the 92top. 93 94#### Screen 95 96This is the primary screen that the user will be looking at, which is relevant 97for some Immersive Audio use-cases, such as watching a movie. We will follow a 98different convention for this frame than what the Sensor framework uses. The 99origin is at the center of the screen. X-axis goes from left to right, Z-axis 100goes from the screen bottom to the screen top, Y-axis goes “into” the screen ( 101from the direction of the viewer). The up/down/left/right of the screen are 102defined as the logical directions used for display. So when flipping the display 103orientation between “landscape” and “portrait”, the frame of reference will 104change with respect to the physical screen. 105 106#### Stage 107 108This is the frame of reference used by the virtualizer for positioning sound 109objects. It is not associated with any physical frame. In a typical 110multi-channel scenario, the listener is at the origin, the X-axis goes from left 111to right, Y-axis from back to front and Z-axis from down to up. For example, a 112front-right speaker is located at positive X, Y and Z=0, a height speaker will 113have a positive Z. 114 115#### World 116 117It is sometimes convenient to use an intermediate frame when dealing with 118head-to-screen transforms. The “world” frame is an arbitrary frame of reference 119in the physical world, relative to which we can measure the head pose and screen 120pose. In (very common) cases when we can’t establish such an absolute frame, we 121can take each measurement relative to a separate, arbitrary frame and high-pass 122the result. 123 124## Processing Description 125 126 127 128The diagram above illustrates the processing that takes place from the inputs to 129the outputs. 130 131### Predictor 132 133The Predictor block gets pose + twist (pose derivative) and extrapolates to 134obtain a predicted head pose (w/ given latency). 135 136### Drift / Bias Compensator 137 138The Drift / Bias Compensator blocks serve two purposes: 139 140- Compensate for floating reference axes by applying a high-pass filter, which 141 slowly pulls the pose toward identity. 142- Establish the reference frame for the poses by having the ability to set the 143 current pose as the reference for future poses (recentering). Effectively, 144 this is resetting the filter state to identity. 145 146### Orientation Compensation 147 148The Orientation Compensation block applies the display orientation to the screen 149pose to obtain the pose of the “logical screen” frame, in which the Y-axis is 150pointing in the direction of the logical screen “up” rather than the physical 151one. 152 153### Screen-Relative Pose 154 155The Screen-Relative Pose block is provided with a head pose and a screen pose 156and estimates the pose of the head relative to the screen. Optionally, this 157module may indicate that the user is likely not in front of the screen via the 158“valid” output. 159 160### Mode Selector 161 162The Mode Selector block aggregates the various sources of pose information into 163a head-to-stage pose that is going to feed the virtualizer. It is controlled by 164the “desired mode” signal that indicates whether the preference is to be in 165either static, world-relative or screen-relative. 166 167The actual mode may diverge from the desired mode. It is determined as follows: 168 169- If the desired mode is static, the actual mode is static. 170- If the desired mode is world-relative: 171 - If head poses are fresh, the actual mode is world-relative. 172 - Otherwise the actual mode is static. 173- If the desired mode is screen-relative: 174 - If head and screen poses are fresh and the ‘valid’ signal is asserted, the 175 actual mode is screen-relative. 176 - Otherwise, apply the same rules as the desired mode being world-relative. 177 178### Rate Limiter 179 180A Rate Limiter block is applied to the final output to smooth out any abrupt 181transitions caused by any of the following events: 182 183- Mode switch. 184- Display orientation switch. 185- Recenter operation. 186