In general, we employ 4 coordinate frames throughout the repo: spherical (panoramic), room Cartesian, world-normalized Cartesian, and world-metric Cartesian.
We start from ZInD's ground truth annotations, provided in a left-handed global coordinate frame. We convert these to a right-handed global coordinate frame: This can be interpreted as looking at a home's floorplan from a vantage point underneath the home, whereas we convert to a vantage point in a bird's eye view (above the home).
We'll now show an example for a specific home from ZInD, (Building 0000, Floor 01). We can see that a transformation (R,t) followed by a reflection over the y-axis is equivalent to a transformation by (R^T,-t) followed by no reflection.
In the top-right, we can see a panorama (Pano 34), and see that the door and window are on the left wall, when facing the garage door. We plot a line segment from each panorama's location along its +y axis in its local coordinate frame (e.g. pointing towards the garage door, as center column of panorama).
Consider the red inset circles on the left and right. We can see that the rotation angle must be negated (equivalent to transposing the rotation matrix) in order for the panorama orientation to stay correct (facing towards a window).
We use the notation wTi = (wRi, wti)
to represent the 2d or 3d pose of the i'th camera in the world frame w
. wRi
is a rotation matrix, and wti
is a translation vector. This transformation transports points p
between coordinate frames such that p_w = wTi * p_i
.
We use the notation wSi = (wRi, wti, s)
to represent a Similarity(2) or Similarity(3) transformation, such that p_w = wSi * p_i
.