visionbook/representing_the_image.qmd at main · Invinsible-Coder/visionbook · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
# Representing the input

Briefly, let's consolidate the multiple ways in which we can represent
the input to a vision system.

An image can be represented in different ways making explicit certain
aspects of the information present on the input. Here we will discuss on
how can we describe the image itself, with a minimum processing. How can
we represented the array of pixel intensities recorded by the camera.

**Ordered array**. The simplest and most direct representation for an
image is an ordered array of pixel intensities or colors:
$$\mathbf{s} =
\left[
\begin{matrix}
s [1,1] & \dots & \\
\vdots & s [n,m] & \vdots \\
 & \dots  & s [N,M] \\
\end{matrix}
\right]$$ Each value is a sample on a regular spatial grid. In this
notation $s$ represents the pixel intensity at one location. This is the
representation we are most used to and the typical used when taking a
signal processing perspective.

**Unordered set of points**. Another representation is a collection of
points indicating its color and location explicitly:
$$S = \left\{ [s_i,x_i,y_i] : i \right\}$$ this representation is
commonly used when we want to make geometry explicit. $s_i$ is the pixel
intensity (or color) recorded at location $(x_i,y_i)$. With this
representation, we can apply geometric transformations easily by
directly working with the spatial coordinates. Although both previous
representations might seem equivalent, the set representation allows
easily dealing with other image geometries where the points are not on a
regular array.

**A function**. The image in represented as a continuous function whose
input is a location $(x, y)$ and it output is an intensity or a color,
$s$: $$s = f_{\theta}(x,y)$$ this representation is commonly used when
we want to make image priors more explicit. The function $f_{\theta}$ is
parameterized by the coefficients $\theta$. This function become
specially interesting when the parameters $\theta$ are different than
the original pixel values. They will have to be estimated from the
original image. But once learned, the function $f$ should be able to
take as input continuous spatial variables.

These three representations induce different ways of thinking about
architectures to process the visual input.

These representations are not opposed and can be used simultaneously.

The ordered array of pixels is the format that computers take as input
when visualizing images. Therefore, it is always important to be
familiar on how to transform any representation into an ordered array.

## Computing similarities