Camera intrinsics exintrics
Camera Pose
Very good reference: https://ksimek.github.io/2012/08/22/extrinsic/
Good demo of look-at camera: https://learnwebgl.brown37.net/07_cameras/camera_lookat/camera_lookat.html
Homogeneous Coordinates
Projection from 3D to 2D
We usually first transform the 3D points to the camera coordinate system by the camera pose, then project the 3D points to 2D image plane by the camera matrix:
3D Point in the world coordinate system: \([x_w, y_w,z_w]^T\) (relative to a defined origin position.)
3D Point in the camera coordinate system: \([x_c, y_c, z_c]^T\) (relative to the camera center position.)
2D Point (Pixel) in the image plane: \([u,v]^T\) (range in \([0, H]\times[0, W]\))
Camera Intrinsic (determined only by the camera itself): \(\mathbf K \in \mathbb R^{3 \times 4}\).
Camera Extrinsic (describes the transformation from world to camera, inversion of camera pose): \(\begin{bmatrix}\mathbf R& \mathbf T \\ 0& 1 \end{bmatrix} \in \mathbb R ^ {4 \times 4}\).
Intrinsic
A \(3 \times 4\) matrix used to project 3D points to 2D coordinates:
\(f_x, f_y\) are the focal length in pixels, usually \(f_x =f_y\).
\(\gamma\) is the skew coefficient, usually 0.
\((u_0, v_0)\) are the principal point (camera center) in pixels, ideally the center of the image, i.e., \((H/2, W/2)\).
Inversely, we can use the intrinsic to project pixel coordinates to 3D points in the camera's coordinate system.
Since a pixel can be projected to multiple depth planes, so we need to know the depth value \(z_c\) in advance.
Extrinsic (w2c)
A \(4 \times 4\) matrix, a regular 3D transformation from world coordinate system to camera coordinate system.
It can be decomposed as:
- first rotate with \(\mathbf R\), then translate with \(\mathbf T\), or
- first translate with \(-\mathbf C\), then rotate with \(\mathbf R\).
\(\mathbf R\) is a rotation matrix. (orthogonal, \(\mathbf R^T = \mathbf R^{-1}\))
\(\mathbf T\) is the position of the world origin in the camera coordinate system,
NOT the camera position in the world coordinate system!
instead, the position of the camera center in the world coordinate system, \(\mathbf C=[x_0, y_0, z_0]\) should be calculated as:
thus, \(\mathbf C = -\mathbf R^{-1}\mathbf T\)
this also gives a way to calculate \(\mathbf{T} = -\mathbf{RC}\).
Pose (c2w)
Also a \(4 \times 4\) matrix, but it describes the 3D transformation from camera to world.
Obviously, camera pose (c2w) is the inversion of extrinsic (w2c).
Note that now the translation vector \(\mathbf{C}\) is the camera's position in the world coordinate system now.
Construct by LookAt
The camera pose matrix in OpenGL is defined as:
Assuming you know the camera position \(\mathbf{C}\), and target position \(\mathbf{O}\), note the forward direction is \(\mathbf{\overrightarrow{OC}}\).
To construct the camera pose matrix, you can calculate the normalized right, up, and forward vector, then simply concatenate them:
Or the camera extrinsic/view matrix similarly:
There are different world/camera coordinate conventions, which are really confusing:
Four common world coordinate conventions:
OpenGL OpenCV Blender Unity
Right-handed Colmap Left-handed
+y +z +z +y +y +z
| / | / | /
| / | / | /
|______+x /______+x |/_____+x |/_____+x
/ |
/ |
/ |
+z +y
Two common camera coordinate conventions:
OpenGL OpenCV
Blender Colmap
up target forward & target
| / /
| / /
|/_____right /______right
/ |
/ |
/ |
forward up
A common color code: x/right = red, y/up = green, z/forward = blue (XYZ=RGB=RUF)
The camera xyz follows corresponding world coordinate system. However, the three directions (right, up, forward) can be defined differently:
- forward can be (camera --> target) or (target --> camera).
- up can align with the world-up-axis (y) or world-down-axis (-y).
- right can also be left, depending on it's (up cross forward) or (forward cross up).
But many datasets are just very confusing and combine different conventions together. You may check a few poses to make sure what the convention they are using... and combine:
- axis switching:
pose[[1, 2]] = pose[[2, 1]]
- axis inverting:
pose[1] *= -1
- forward inverting:
pose[:3, 2] *= -1
- up inverting:
pose[:3, 1] *= -1