[Auto]Stitching Photo Mosaics

Overview

In this project, images are stitched together to create mosaics. The first section involves manually selecting correspondence points in two images and then recovering the homography matrix $H$ which is then used to warp one image to align with the reference. The two aligned images are then blended into a seamless mosaic. The second part follows this same procedure, but rather than manually labeling points, keypoints are automatically defined through corner detection and feature matching.

Shooting the Photos

Images for rectification

I captured photos of objects with rectangular faces, one being a power outlet in Cory Hall, and another being a print of my artwork. By taking the pictures at an angle, the face which is rectangular at an aerial view, appears as a trapezoid.

outlet.jpg

butterflies.jpg

Images for mosaics

For each scene, I shot multiple photos from different angles by fixing the center of projection (COP) and rotating the camera.

leconte3.jpg	leconte2.jpg
soda1.jpg	soda0.jpg
moffitt0.jpg	moffitt1.jpg

Recovering Homographies

Using the images taken, correspondence points were defined using this labeling tool. In two images of the same scene from different angles, these points mark key common regions. The relation between these points is represented by the homography matrix $H$. More generally, a homography is a mapping between any two projective planes with the same COP. Given the correspondence points, the homography is recovered by setting up a system of equations and solving for $H$ via np.linalg.lstsq.

Manually-selected keypoints
on soda1.jpg

Manually-selected keypoints
on soda0.jpg

Suppose we have points $\vec{p}_i$ in im1 and $\vec{q}_i$ in im2.

$$\vec{p}_i = \begin{bmatrix} p_{xi} \\ p_{yi} \end{bmatrix} \quad \vec{q}_i = \begin{bmatrix} w \cdot q_{xi} \\ w \cdot q_{yi} \end{bmatrix}$$

Because $\vec{p}_i$ and $\vec{q}_i$ are both $2 \times 1$, homogeneous coordinates are added to allow for the affine transformation.

$$\vec{p'}_i = \begin{bmatrix} p_{xi} \\ p_{yi} \\ 1 \end{bmatrix} \quad \vec{q'}_i = \begin{bmatrix} w \cdot q_{xi} \\ w \cdot q_{yi} \\ w \end{bmatrix}$$ $$ H := \begin{bmatrix} a & b & c \\ d & e & f \\ g & h & 1 \end{bmatrix} \quad \text{st} \quad H\vec{p'}_i = \vec{q'}_i$$

Image Rectification

For each image to be rectified, the target rectangular coordinates were hardcoded. Then, the homography matrix $H$ between the original trapezoidal coordinates and the target coordinates was computed. Given the image and $H$ the function warpImage(im, H) maps the image to the target coordinates through the following process:

Obtain the coordinates of the four corners of the original image.
Transform the four corners by $H$, storing the result in warp_corners.
Given the minimum and maximum coordinates in warp_corners, determine the dimensions of the output image.
Compute $H^{-1}$
Use inverse-warping to transform each point in the output image, mapping it to the points in the original image.
Use scipy.ndimage.map_coordinates to interpolate the color values of the original image to these inverse-warped coordinates.

original outlet.jpg

Initial vertices

Vertices of the target rectangle

Rectified outlet.jpg cropped to the dimensions of the original image

original butterflies.jpg

Initial vertices

Vertices of the target rectangle

Rectified butterflies.jpg cropped to the dimensions of the original image

Blending the Images into a Mosaic

Given a pair of images and correspondence points, one image is selected to be the reference image ref and the other being the image to be warped to_warp. With these defined, the images are then merged into a mosaic through a process similar to rectification.

Compute $H$ such that it maps the points in to_warp to the points in ref.
result = warpImage(to_warp, H)
Because the dimensions of result would likely be different from the initial dimensions, pad result and ref to the same dimensions (the maximum of the two).
Blend the images together.

The blending was done with a Laplacian stack and mask as demonstrated in the previous project Fun with Filters and Frequencies.

Vertical mask used to blend mosaics
$r = 20, \sigma=10$.

leconte3.jpg	leconte2.jpg	Blended mosaic
soda0.jpg	soda0.jpg	Blended mosaic
moffitt0.jpg	moffitt1.jpg	Blended mosaic

Harris Corner Detection

To automate the selection of correspondence points between two images, interest points are first identified in each image. For feature-based image stitching, salient features, such as corners, make useful interest points. A corner is the junction between two edges and can be recognized by looking through a small window; shifting this window in any direction should produce a significant change in intensity. This is the idea behind the Harris corner detection algorithm which is implemented in the function get_harris_corners. This function is used to identify the coordinates of corners in the scene, discarding points along the borders of the image.

Corners detected on soda1.jpg

Corners detected on soda0.jpg

Adaptive Non-Maximal Suppression (ANMS)

Given all the corners in the images found through Harris corner detection, not all of these would be necessary to use as correspondence points. Therefore, to select the strongest corners, Adaptive Non-Maximal Suppression (ANMS) is applied. To perform ANMS, the function dist2 is used to calculate the pairwise distances of the points. In addition, each point $x_i$ is compared to all other points $x_j \in I$ based on their strengths $f(x_i)$ (found via get_harris_corners). To ensure that a point is sufficiently stronger than its neighbor, $c_{\text{robust}} = 0.9$ is utilized to scale the strengths of neighboring points. Through this process, the region (represented by the radius $r_i$ from the point) in which a point is the local maximum is computed. Thus, the point with the greatest radius is the global maximum.

$$r_i = \min_j |x_i - x_j| \quad \text{st} \quad f(x_i) < c_{\text{robust}} f(x_j) \quad x_j \in I$$

$x_i =$ 2D interest point location
$r_i =$ radius from $x_i$ within which $x_i$ has the highest strength
$f(x_i) =$ strength of point $x_i$
$I =$ set of all interest point locations
$c_{\text{robust}} = 0.9$

After the radii for all points has been computed, the top threshold = 500 points with the greatest radii are kept.

soda1.jpg keypoints after ANMS

soda0.jpg keypoints after ANMS

Feature Descriptor Extraction

Once the strongest keypoints have been selected, among these, the points marking regions in common between the two images are identified. To match common areas, features are first extracted. For each point, a $40 \times 40$ description of the local image around it is obtained. This window is resized to be $8 \times 8$, normalized, and then flattened into a $1 \times 64$ descriptor vector. This is done for all 3 color channels, stacking them in the end.

$40 \times 40$ feature descriptor from soda1.jpg

Feature Matching

Having extracted the features for each keypoint, these features are then compared between the two images. Each feature in ref is compared to all features in to_warp by sum of squared distances (SSD).

$$SSD = \sum^n_{i=0} (A_i - B_i)^2 $$

$n=$ total number of values in the array
$A_i =$ feature descriptor in ref
$B_i =$ feature descriptor in to_warp

During this process, the nearest and second nearest neighbors are recorded (the features which produce the lowest dist1 and second lowest error dist2). Then, the ratio dist1 / dist2 is calculated. Using Lowe's technique, if there is a valid match, the nearest neighbor must be significantly better than the second nearest neighbor. Therefore, the ratio must be less than the threshold, chosen to be $0.65$. If this inequality is not satisfied, both candidates are rejected.

These comparisons are done for all points in both images, and only mutual matches are kept.

soda1.jpg keypoints after feature matching

soda0.jpg keypoints after feature matching

Random Sample Consensus (RANSAC)

Even after ANMS and feature-matching, there may still be outliers among the correspondence points. To further filter the keypoints, Random Sample Consensus (RANSAC) is implemented. At each step of RANSAC, $4$ random pairs of points are used to compute a homography $H$. For each pair of points $p, q$ ($p$ from to_warp and $q$ from ref), the SSD between $q$ and $Hp$ is computed. If this distance is within 2 pixels, the pair remains. This is repeated for $10,000$ iterations which generates the maximum set of inliers.

soda1.jpg keypoints after RANSAC

soda0.jpg keypoints after RANSAC

These points are then used to compute $H$ which is applied to warp to_warp. After warping, to_warp and ref are stitched together to create the final mosaic.

leconte3.jpg	leconte2.jpg	Manually-stitched mosaic	Automatically-stitched mosaic
soda1.jpg	soda0.jpg	Manually-stitched mosaic	Automatically-stitched mosaic
moffitt0.jpg	moffitt1.jpg	Manually-stitched mosaic	Automatically-stitched mosaic

Conclusion

This project was very challenging, and honestly, I found it most rewarding to learn about image rectification. It was fascinating to see how the face of a skewed rectangular object can be recovered through linear algebra and the applications of this tool in art history.