Prelim notes: Vision

Cameras

In most camera models, light passes through a pinhole and falls on a detecting screen on the opposite side. Ideally, this pinhole is a point, so it allows exactly one ray in each direction to pass through (otherwise a small solid angle. By convention we invert the coordinates on the screen. Then, to find the image of a point $(x, y, z)$ in the scene, onto a screen at focal length $f$ , we have $x' = \lambda x, y' = \lambda y, f = \lambda z$ , so $(x', y') = (fx / z, fy / z)$ .

Sometimes it’s useful to make approximations, esp. if it lets us get a linear model of the camera. One approximation is that all objects in the scene are at a constant distance from the camera (we’re taking a picture of a mural). Then the image of every line segment is scaled by them same magnification factor. Then we have $(x', y') = (mx, my)$ for $m = f / z$ . This is weak perspective. We might additionally assume that $m = 1$ ; then $(x', y') = (x, y)$ . This is orthographic projection.

Or we can project onto the surface of a sphere. This has the nice property that the image of a circle is always a circle (which is only true in planar perspective projection if the sphere lies in the center of the image—otherwise some conic section. But the image of a line is also a circle in spherical projection, which is sometimes undesirable.

Real cameras of course have lenses—these allow them to gather light (i.e. more than a single ray) while still maintaining a sharp image at some focal length.

For small angles, the following relation holds for the image of a point at distance $d_1$ onto a points at $p_2$ colinear with $p_1$ and the center of the lens:

$n_1 / d_1 + n_2 / d_2 = (n_2 - n_1) / R$

where $n_1$ is the index of refraction of the air, and $n_2$ is the index of refraction of the lens. TODO prove. Thus, in turn, for a thin lens we have $(1/z' - 1/z = 1/f)$ for a point at $-z$ , its image at $z'$ , and $f = R / (2(n_2-1))$

Note that this looks just like pinhole perspective if we take $z' = f'$ ; rays behave as though they passed through a pinhole at the center of the lens, but are only in sharp focus at a distance $z'$ .

So far we’ve assumed an idealized setup in which the center of the image of the camera is the same as the center of the scene, everything is axis-aligned, and everything is continuous. What if we relax these assumptions (so the camera can move freely, the pixels can be off-rectangular, etc.)? Divide these effects into intrinsic parameters, which relate actual properties of the measuring device to the ideal camera reference frame discussed above, and extrinsic parameters, which relate this reference frame to the reference frame of the scene.

Homogeneous coordinates include a constant term; let us do translations with matrix multipies. In particular, we can represent any change of coordinates with a rigid transformation as

$^A P = \left[ \begin{array}{cc} R & t \\ 0^\top & 1 \end{array} \right]$

for a rotation matrix R.

Intrinsic parameters

We associate with our camera a normalized image plane parallel to the physical retina, but a unit distance from the pinhole. Then the image of a point $P$ (in homogeneous coordinates of the camera’s reference frame!) is $1/z P$ . Some modifications: pixels are not square, so we replace the distance $f$ with independent params $\alpha$ and $\beta$ , and the origins are not centered, so we additionally need a translation by $x_0$ and $y_0$ . Finally, if the camera system is itself skewed, so the angle between the axes is $\theta$ , we need a skewing. Putting it all together, we have

$K = \left[ \begin{array}{ccc} 1 & & x_0 \\ & 1 & y_0 \\ & & 1 \end{array} \right] \left[ \begin{array}{ccc} \alpha & & \\ & \beta & \\ & & 1 \end{array} \right] \left[ \begin{array}{ccc} 1 & -1/\tan \theta & \\ & 1/\sin \theta & \\ & & 1 \end{array} \right] = \left[ \begin{array}{ccc} \alpha & -\alpha/\tan \theta & x_0 \\ & \beta/\sin \theta & y_0 \\ & & 1 \end{array} \right]$

which projects points from a scene (for now still ideal) to the reference frame of the sensor.

Extrinsic parameters

Assuming the change of coordinates is rigid, we can write this just as above: $[R t]$ for a rotation $R$ and a translation $t$ . above. This has an explicit form, and in particular has only six free parameters, but we won’t worry about the exact form for now. Then, to get a point from the world to the camera reference frame, we have $p = (1/Z) K [R t] P$ .

TODO this doesn’t look right.

Geometric camera calibration

Choose a set of reference points, and find the least squares solution to the above—this gives the full matrix K [R t]. Because the form of this matrix is overconstrained, we can then take the 12 entries in this matrix and use them to solve for the 11 free camera parameters.

Radiometry

Our models of a measuring device will usually a sphere onto which light falls. We can describe the “picture” a source creates on the surface of this ball by measuring its solid angle. In two dimensions, if I have a line of length $dl$ projected onto a circle of radius $r$ , the projection of the line onto the circumference of the circle subtends an angle $d\phi = (dl \cos \theta) / r$ , where $\theta$ is the angle between the normal to the surface of the circle at $l$ and the normal to $dl$ . In three dimensions, similarly, we have $d\omega = (dA \cos\theta / r^2)$ for a patch of area $A$ . We can also write $d\omega = \sin\theta d\theta d\phi$ .

Radiance is a directional measurement of the distribution of light in space, defined as the amount of energy arriving at some point in a specified direction, per unit time, per unit area perpendicular to the direction of travel, per unit solid angle. Think of the area in the denominator as belonging to the radiant surface, and the solid angle as belonging to the measuring device. For any pair of points $p$ and $q$ , the radiance leaving $p$ in the direction of $q$ is equal to the radiance arriving at $q$ in from the direction of $p$ . TODO proof.

Irradiance is radiance not foreshortened—just radiance multiplied by $\cos \theta d\omega$ . This is the total incident energy per unit area.

The bidirectional reflectance distribution function (BRDF) is the ratio of radiance in the outgoing direction to the incident irradiance.

Radiosity is the total power leaving a surface per unit area. It is the integral of radiance over the exit hemisphere.

Directional hemispheric reflectance is the fraction of incident irradiance reflected in some input direction. It is the integral of the BRDF over the exit hemisphere. Surfaces whose reflectance does not depend on illumination direction (i.e. for which BRDF is constant in the outgoing direction) are called ideal diffuse or Lambertian surfaces. For these directional hemispheric reflectance is called albedo.

To summarize: Radiance is for light travelling in free space, irradiance is for light arriving at a surface, and radiosity is for light leaving a diffuse surface. The BRDF represents direction-dependent reflection off of general surfaces, directional hemispheric reflectance represents reflection off of surfaces where direction is unimportant, and albedo is DHR for a diffuse surface.

Geometric image features

What is the image of a solid on a flat surface? In general this projection is just a linear operation, so polynomial curves remain polynomial curves. Curved surfaces are more complicated, as parts of a surface not on the boundary can nonetheless create occlusion. The occluding contour of a smooth surface is a smooth curve formed by fold points (where the viewing ray is tangent to the surface) and cusps (where two folds meet; only visible in transparent objects).

(Differential geometry)

Linear filters

For various tasks we want to obtain representations of an image where each pixel is replaced with a linear combination of its neighbors—we can use this to do smoothing, edge detection and feature extraction.

We call the set of weights for neighbors a filter, and the process of applying the filter to each pixel a convolution. For a filter $H$ and image $R$ , the convolution has the form

$R'_{ij} = \sum_{u,v} I_{i-u,j-v} F_{u,v}$

Observe that convlution is a linear operator.

So far we’ve been working in a basis of single pixels (or Dirac deltas, in the continuous case), but it’s often convenient to work in a Fourier basis instead. Recall the Fourier transform

$\begin{align*} F(g(x,y))(u,v) &= \int \int g(x,y) \exp(-2i\pi(ux + vy) dx dy \\ &= \int \int g(x,y) \cos(2\pi(ux + vy)) + i \sin (2\pi (ux + vy)) \end{align*}$

i.e. an inner product between the function g(x,y) and a sum of sinusoids.

Important fact: (“convolution theorem”) convolution in the time domain is pointwise multiplication in the frequency domain, and vice-versa. TODO prove.

Most signals that we want to work with have an underlying continuous representation, but we only get discrete samples. What happens when we try to take the Fourier transform of a signal discretized in this fashion? We’ll represent our sampled signal as a sum of $\delta$s scaled appropriately—that is, as a pointwise product between the original function and a function with $\delta$ s at all the sampling points (a “Dirac comb”). Convolution theorem says this is the convolution of the Fourier transforms of the two signals.

Important fact 2: the FT of a Dirac comb is a Dirac comb. TODO prove.

Thus the Fourier transform of a sampled signal is a sum of copies of the Fourier transform of the original function, displaced by an amount inversely proportional to the sampling frequency. The Nyquist sampling theorem gives conditions under which none of these copies overlap, making it possible to exactly recover the Fourier transform of the original function. Otherwise we get aliasing—high frequencies appear to be low frequencies, and the reconstructed image is distorted.

Edge detection

Useful feature for downstream vision tasks: all of the edges in an image. What is an edge? Heuristically, places with a large change in the intensity of the input image—that is, points where the derivative of the input signal is large.

In the time domain, differentiation is a convolution. In the Fourier domain, it corresponds to multiplication of each component of the signal by a factor proportional to its frequency (so the constant term gets zeroed out, and all other frequencies increase—this is high school calculus). The discrete convolution for one partial derivative might look like: $\left\{ \begin{array}{ccc} 1 & 0 & -1 \end{array}\right\}$ This is extremely sensitive to noise, so we might want to smooth with a Gaussian before differentiating (more on this later).

Instead of the first derivative, another heuristic is to look for zeros of the second derivative (corresponding, hopefully, to inflection points along boundaries). For this we can use the Laplacian $(\nabla^2 f)(x,y) = \frac{\partial^2 f}{\partial x^2} + \frac{partial^2 f}{\partial^2 y}$ , which is rotation-invariant (looks like Euclidean distance). Smoothing with a Gaussian, then applying a Laplacian, is the same as convolution with the Laplacian of the Gaussian. So doing this then marking zeros gives us an edge detector. This filter behaves badly at corners and intersections.

More on noise: suppose we add (pointwise independent) Gaussian noise to an image, then filter? Linearity tells us that we only need to know what happens when we apply the filter to the noise signal alone; a little bit of algebra tells us that $E[R_{ij}] = \mu \sum_{u,v}G_{uv}$ and $Var[R_{ij}] = \sigma^2 \sum_{u,v} G^2_{uv}$ . Note in particular that smoothing with a probability distribution tends to decrease derivatives, because $\sum_{u,v} G_{uv} = 1$ so $\sum_{u,v} G_{uv}^2 \leq 1$ .

Modern way of doing edge detection: explicitly compute the gradient everywhere, then look for large maxima. One way to do this: pick a point that we think is an edge, compute the gradient, and then search around it for other points where the gradient is large and in the same direction.

Misc. other useful smoothing techniques: median, erosion, dilation.

TODO named things: Canny, Sobel

Filters and features

Remember: a filter is just a dot product; the intensity at a given point is just a measurement of how much the region around that point looks like the filter. Can think of this as a change of basis.

First layers of the human visual system look a lot like a bunch of edge detectors.

TODO SIFT features TODO Hough transform

Stereopsis

Suppose I have a single point observed by a pair of cameras. Then easy to figure out exactly where the point lies in a scene: just draw the ray outward from the two cameras until they intersect. Assuming we have projection matrices for a pair of cameras, these define a system of linear equations that we can solve explicitly to find the depth of the point.

This becomes harder if we have multiple points, as the order (and even number) observed on each sensor might be different (easy example: small ball in front of big ball). Suppose we already have an alignment between points in the two pictures. Then we can also write down this system of equations, but it’s overconstrained, and if there’s any error in our estimation of the cameras’ projection matrices, it won’t have a solution. Easy fix: just find the least squares solution instead.

Useful preprocessing step for stereo vision: rectify the images sensors by making them lie parallel on a common baseline. We assume (as before) that the origin of each is a unit distance away from the sensor. Then for any pair of aligned points in the two images, it’s useful to compute the distance $d$ between the left and right projections—from this we immediately have $z = (b - b') / d$ , where $b$ and $b'$ are the positions of each origin. (This seems wrong for the coordinate system in the book, but in any case easy to reconstruct $z$ for an arbitrary such system.)

So how do we actually find correspondences between points? First thing we need is a similarity measure. Easy soln: define a sliding window, and measure the similarity of two points as the dot product between the windows centered on them. TODO modern approaches (sift, etc.).

Or look for edges and corners using previously-discussed edge-finding techniques. Can also match features on a variety of scales (by smoothing first)

Once we have a similarity measurement, we need to align the points. This is NP-complete (max matching), but if we make certain assumptions we can do better. In particular, if we assume monotonicity of the mapping, then the dynamic program looks exactly like edit distance. Monotonicity is not always right (q.v. small ball example), but lets us do this efficiently (and we can handle small deviations by increasing the state space).

Segmentation

One view of segmentation is that it’s basically just clustering. This means the usual clustering tools apply immediately. Start by defining some similarity measure between clusters, or some measure of cluster coherence. Then:

Run agglomerative / divisive clustering on all the pixels in the image, and stop at some threshold
Build a model of each cluster, and define a measure of goodness of membership in the cluster, then run K-means (more on this later)
Something graph-theoretic/spectral starting from a measure of similarity (“affinity”) between points. This is the usual thing: if I have an affinity matrix $A$ , and a (soft) membership vector $w$ , I want to maximize
$w^\top A w + \lambda(w^\top w - 1)$
Differentiating, we see that this is maximized by $(\lambda, w)$ an eigenpair of $A$ . So walk down the eigenvectors, forming clusters with each new set of $» 0$$ entries (hopefully there’s a big difference between them and the ones that follow) until we’ve used everything up.

Or use the method of normalized cuts. First define a diagnoal matrix $D$ whose entries are the number of edges at a node. We want to maximize
$\frac{cut(A,B)}{assoc(A,V)} + \frac{cut(A,B)}{assoc(B,V)}$
for a split $(A,B)$ of a graph $V$ , where $cut$ is the sum of all weights crossing, and $assoc$ is the sum of all weights with one end in the query cluster. We can write this
$\frac{y^\top (D - A)y}{y^\top D y}$
for a membership vector $y$ , whose entries are $1$ for cluster $A$ and $-b$ for cluster $B$ . How to solve this? We can get the soft $y$ by letting $z = D^{1/2} y$ , from which $D^{-1/2} (D-A)D^{-1/2}z = \lambda z$ , then threshold.

Mean shift algorithm: start with a bunch of data points, run kernel smoothing, find the modes of the smoothed image, and propagate these back to their nearest neighbors.

Texture

What is texture? Collections of small objects, orderly patterns—hard to formalize. One thing that seems to line up well with our intuition: define a filter bank with a bunch of filters, and apply these filters at different scales (and orientations?)

Depth from range finding

Once we have depth data, we want to segment things into flat surfaces. Basically, just smooth and then look for rapid changes in the first and second derivatives. Or, may want to do it in the opposite direction—look for large discontinuities first, place boundaries at these, and then smooth to get the actual form of the reconstructed surfaces.

We may want to do object recognition based on these 3-d profiles. One way to get a good semi-local description of a surface is with spin images: at each oriented point, compute vertical and horizontal distance for nearby points (in the normal and tangent plane respectively); construct a density map. Kinect uses these (as features in a decision tree) to classify points as belonging to different body parts. Can use this to compute joint positions.

Structure from motion

Given a moving camera (equiv: a bunch of static cameras in different places) we want to reconstruct a 3d scene. This is different from stereopsis because we don’t know the relationship of the cameras to the scene. Assuming the relevant

coordinate transformations are rigid, we want to jointly recover the position of scene points, and of rotations and translations for every observing camera.

Note that this problem is naturally ambiguous—in particular, we cannot figure out the scale of the scene. It is also ambiguous up to arbitrary rigid translations. Given $m$ cameras and $n$ points, we have a system of $2mn$ equations in $6m$ camera unknowns and $3n$ geometric unknowns (of which we can fix 6 arbitrarily to account for ambiguity), the problem is only determined for $2mn \geq 6m + 3n - 7$ . With two cameras, five points suffice. In general we can try to find a least-squares solution to the problem. This is nonconvex:

$E = \frac{1}{mn} \sum_{ij} ||p_{ij} - \frac{1}{Z_ij}\left[ R_i\ t_i \right] \left[ \begin{array}{c} P_j \\ 1 \end{array} \right] ||^2$

so it’s important to have a good initial guess.

How do we do this?

Image classification

TODO deformable parts model

Camera calibration

Optimization techniques

— 4 August 2014

← text, home