Auto-Calibration
As the world is projected on our eyes, the objects in the world can be projected on the image plane of a camera. For this projection, the intrinsic and extrinsic matrices of the camera are required to render the image plane. In other words, if the shape of objects in the 3D world is already defined, estimating what it looks like on the image plane is related to these matrices.
Especially, finding out the intrinsic matrix is about camera calibration which can be done with or without some materials such as a chessboard.
Simply put, it helps to decide which color is proper to be painted in each pixel of the image plane. I wanted to research this work for the 3D reconstruction project which is the one of my personal projects. For that, I needed some calibration methods processed automatically which are more technically called auto-calibration.
PHOTO BY SNAPWIRE
Requirement
Problem Definition
\begin{equation}K=
\begin{pmatrix}
f & 0 & o_x \\
0 & f & o_y \\
0 & 0 & 1
\end{pmatrix}
\end{equation}
where `f` is the focal length, `o_x` and `o_y` are the half of the camera width and height. Meanwhile, these points are projected on the camera as `t(u_t, v_t, 1)^tcongKT` and `b(u_b, v_b, 1)^tcongKB`. Now, `T` and `B` can be written as follows:
\begin{equation}
T(x_t, y_t, z_t) = \left(\frac{(u_t-o_x)(H-h)}{f \sin\theta+(v_t-o_y) \cos\theta}, \frac{(v_t-o_y)(H-h)}{f \sin\theta+(v_t-o_y) \cos\theta}, \frac{f(H-h)}{f \sin\theta+(v_t-o_y) \cos\theta}\right)
\end{equation}
\begin{equation}
B(x_b, y_b, z_b) = \left(\frac{(u_b-o_x)H}{f \sin\theta+(v_b-o_y) \cos\theta}, \frac{(v_b-o_y)H}{f \sin\theta+(v_b-o_y) \cos\theta}, \frac{fH}{f \sin\theta+(v_b-o_y) \cos\theta}\right)
\end{equation}
- Cost Functions
1. x-axis: the top point can be projected on the ground as the bottom one.
At first, we can imagine that the top and bottom points in 3D camera coordinates are transformed to the default camera coordinates whose tilt angle is zero. In the default camera coordinates, it is reasonable that the top point can be projected on the ground as the bottom one although it is possible that people do not always stand upright and the top and bottom points are distorted in camera view. Therefore, the `x` components of the top and bottom points in the default camera coordinates should be the same, which allows the following cost function:
\begin{equation}
E_1(f, H, \theta) = \frac{(u_t-o_x)(H-h)}{f \tan\theta+v_t-o_y}-
\frac{(u_b-o_x)H}{f \tan\theta+v_b-o_y}
\end{equation}
2. y-axis: the top point is `h`-meter high from the bottom one.
Second, we already assume that the human height is `h`-meter high which is 1.8 meters, so it can be used for another cost function. The `y` component of the top point in the default camera coordinates is `h`-meter higher than that of the bottom one. It induces the following cost function:
\begin{equation}
E_2(f, H, \theta) = \frac{(v_t-o_y)(H-h)}{f \tan\theta+v_t-o_y} - \frac{(v_b-o_y)H}{f \tan\theta+v_b-o_y} - h \cos\theta
\end{equation}
3. z-axis: the top and bottom points have the same depth.
Third, it is natural that the top and bottom points in the default camera coordinates have the same depth although the human head can get ahead of its foot a little or vice versa. It means that the final cost function can be written as follows:
\begin{equation}
E_3(f, H, \theta) =
\frac{(f-(v_t-o_y) \tan\theta)(H-h)}{f \tan\theta+v_t-o_y} -
\frac{(f-(v_b-o_y) \tan\theta)H}{f \tan\theta+v_b-o_y}
\end{equation}
These three cost functions represent for the only one human sample, but it can be applied for many samples using summation of each cost funcion, which is the more natural situation. When samples are summed, `L_1` norm or `L_2` norm can be used. There are many non-linear optimization methods with these cost functions, and libraries including ceres solver.
- When the only one parameter is unknown
emoy.net