visionbook/visionscience.qmd at main · Invinsible-Coder/visionbook · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
# Looking at Images {#sec-looking_at_images}

## Introduction

Vision is about solving an amazing problem: Can we build a system
capable of perceiving its surrounding world using only the information
available in the small number of electromagnetic waves that fall into
the observer's sensor (eye or camera), after accidentally bouncing off
the objects surrounding the observer? How is it possible to sense the
world and its three-dimensional (3D) structure without touching it? It
seems quite likely that if humans did not have the sense of vision, we
would think that solving this problem is impossible.

The goal of this chapter is to present vision not as an engineering
problem that we want to solve, but as a set of scientific questions that
we want to answer. We want to understand how it is that our visual
system can interpret the world the way it does. We want to explain what
components of an image can be used to infer certain properties of the
world. We want to know what questions about the world can be answered by
just looking at an object with the light that reaches our eyes.

In this chapter we will not answer any of those questions, instead, we
will use images to motivate a set of questions that vision scientists
and computer vision engineers have been studying for many years. The
goal of this chapter is to help you ask questions about how do you
perceive the world around you.

## Looking at Individual Pixels

Before we build systems that can see, it is good to learn to look at
images and to pay attention to individual pixels. The image in
@fig-pexels-retha-ferguson is a tiny image with just $32 \times 32$
pixels.

:::{.column-margin}
What is the minimum number of pixels needed to form a recognizable image? The answer depends on the image. It is possible to create visually recognizable images with very few pixels @torralba_2009
:::

It is so small that we can actually look at each single pixel
and think about how it is that we can make sense of it. For instance,
let's look at the pixel in row 13 counting from the bottom, and column 12
counting from the left. That pixel corresponds to a dark gray pixel,
which is near a couple of other pixels with similar intensity,
surrounded by a set of white pixels. That pixel seems to be a pen, but
how do we know what it is? The pen is barely three pixels long and one
pixel wide. How can that provide enough information?

![A tiny image with 32 $\times$ 32 color pixels. Despite the very low
resolution, we can still recognize most of its content. *Source*:
Original image created with Dall-E, and downsampled with
Photoshop.](figures/visionscience/DALLE_32x32.png){#fig-pexels-retha-ferguson width="90%"}


:::{.column-margin}
Same image as in @fig-pexels-retha-ferguson but shown small and blurred to remove the artifacts due to the square pixels.

![](figures/visionscience/DALLE_32x32_blur.png){width="40%"}

Reducing the size and blurring helps better recognizing the objects despite containing the same visual information.
:::

Recognition of the meaning of that pixel can not come just from the
intensity of that pixel or the few pixels nearby. There are way too many
possible situations in which different objects in the world could map to
a similar set of pixel intensities. Instead, is the whole image and the
context of the pen that provides the necessary information to recognize
the pen. Taken in isolation, those pixels are impossible to recognize.
Here we show, from left to right, the hand, the blue mug, and the pen.

![Small image patches from in isolation. It is difficult to
recognize image patches outside their natural context.](figures/visionscience/dalle-small.png){#fig-small-dalle}


## The More You Look, the More You See

Visual perception cannot be formulated as a simple input-output function
over predefined domains. Vision is a dynamical system, even when looking
at a static image. The longer you look at an image, the more details you
see and the better you understand the scene. Just look at the image
shown in @fig-street_in_paris and try to write down everything you see
down to the smallest detail.

![A street in Paris. Is everything fine in this picture?](figures/visionscience/ADE_train_00015819_messed_small.jpg){#fig-street_in_paris width=".98"}

But this feeling of unlimited and effortless understanding of a picture
is also an illusion. Do you truly understand everything you see? For
instance, are you sure you can make sense of all the chair legs in
@fig-street_in_paris? In fact, this image has been manipulated so that
some of legs do not correspond to any chair.


:::{.column-margin}
When solving a task, as the **visual cognitive load**
increases we are more likely to make judgment mistakes.
:::


As you look around the image, notice that each patch you look at is
actually a small photo in its own right. In fact, a big image like this
contains thousands of tiny images within it (@fig-tiny_images). A single
photo can be *big data*; if you chop up this photo you get a big dataset
of tiny images. A common trick in computer vision is to take a method
that was developed for processing a dataset and instead apply it to the
set of patches in an image, or vice versa.

![Each picture contains lots of
images.](figures/visionscience/tiny_images.jpg){#fig-tiny_images
width="100%"}

## The Eye of the Artist

Learning to paint is a great way of learning to see. You start from a
white piece of canvas and a finite set of acrylic paints, and your goal
is to paint an object that should look real, with shadows, 3D shape, and
reflections. What mixtures of paint and what strokes can accomplish
that? Artists learn to see the world not by what it represents, but as
why it looks the way it does.


The following sequence @fig-agata_painting painted by Agata Lapedriza, shows different
steps in the painting process of a strawberry. Layer after layer, the
artist replicates the light that will fool your eye into seeing a
strawberry instead of acrylic paint. It is interesting to see that the
realism does not increase monotonically as the painting progresses. For
instance, from step 3 to 4, the right side of the strawberry has lost
realism. However, those new strokes will become the shading that will
make the strawberry pop out from the canvas.


![The following sequence shows different steps in the process
of painting a strawberry. Source: Original images by Agata Lapedriza.](figures/visionscience/strawberries.png){#fig-agata_painting}


## Tree Shadows and Image Formation

When walking under the shadow of a tree on a bright sunny day when the
sun is right above us (@fig-tree_pinholes\[left\]), we can see how some
light rays get through the holes left between leaves, creating spots on
the floor (@fig-tree_pinholes\[right\]).

![The shadow of this tree on a normal day projects dozens of
pictures of the sun on the ground.](figures/visionscience/shadow){#fig-tree_pinholes}

But something is a bit odd about the shape of the bright spots projected
on the floor. We would expect that the spots of light would have
irregular shapes as the openings between leaves will have all kinds of
irregular profiles (@fig-tree_pinholes\[left\]). Instead, as shown in
@fig-tree_pinholes(right), the spots appear to be circular and of
similar size. This observation already puzzled Aristotle. What could be
happening? As a hint, @fig-tree_shadow_eclipse shows the shadow of
leaves during a solar eclipse.

![During an eclipse, the shadow of the tree contains many copies of the
crescent
sun.](figures/visionscience/tree_shadow_eclipse2.jpg){#fig-tree_shadow_eclipse
width="90%"}

One of the most common situations that we often encounter is the
**pinhole cameras** formed by the spacing between the leaves of a tree.
The tiny holes between the leaves of a tree create a multitude of
pinholes. The pinholes created by the leaves project different copies of
the sun on the floor. When the openings between leaves are small enough,
the shape of the projected spot light is dominated by the shape of the
sun. This is something we see often but rarely think about the origin of
the spotlights that we see on the floor.

In fact, the leaves of a tree create pinholes that produce images in
many other situations. We will talk about pinhole cameras in .

## Horizontal or Vertical

Looking at the previous picture of the shadow of the tree, another
interesting question comes to mind: How do you know you are looking at a
picture of the ground? How do you know the orientation of the camera?

Look at the two drawings in @fig-vertical_or_horizontal. The lines in
the drawing on the left correspond to tiles in a floor, the line drawing
on the right is the same drawing rotated 90 degrees. Which one seems to
be an horizontal surface and which one seems vertical? Why? The
differences in the line orientations are subtle and yet it produces a
very different perception.

![Line drawings. The one on the left seems to correspond to an
horizontal surface while the one on the right looks like a vertical
surface.](figures/visionscience/vertical_or_horizontal.png){#fig-vertical_or_horizontal
width="100%"}

In the arrangement in the left, there are two **vanishing points** (the
points in which parallel lines intersect) and both are above the
drawing, compatible with the horizon line. However, on the drawing on
the right, one of the vanishing points is below the figure and gives the
impression that the camera is looking at a vertical wall (the camera
seems to be higher than the visible part of the wall). The horizon line
of the plane (the line that connects both vanishing points) is
horizontal in the left drawing and vertical on the right one.

## Motion Blur

When we are in a moving car and we take a picture from the passenger
window, often images will appear blurring, especially when taken at night
(@fig-Paris_from_moving_car). Blur is due to the fact that the camera
sensor is collecting light during a period of time while the car is
moving and the picture that results is the average of several translated
copies of the same image. However, note that not all of the objects are
equally blurry. Things that are far away might still look sharp while
things nearby appear very blurry.

![Picture of Paris taken from a moving
car.](figures/visionscience/motion_blur.jpg){#fig-Paris_from_moving_car
width="100%"}

The amount of blur is related to the car speed, the exposure time, and
the distance between the camera and each object in the scene. We will
talk more about how camera motion relates to distance in .

## Accidents Happen


:::{.column-margin}
Principle of continuity. In the following drawing, we
see one wave and one line:

![](figures/visionscience/principle_continuity_1.png){width="100%"}


Despite this, other groupings are also possible.

![](figures/visionscience/principle_continuity_2.png){width="100%"}


:::

 We learned about the generic view principle in , and this idea comes
up a lot in computer vision. But it turns out that in most photographs
it's easy to find accidents. This is because photos contain so many
things, thousands of tiny images, and there are so many coincidences
that could occur. @fig-visionscience-migrant_mother shows a famous
photo and we will look for some coincidences in it. The coincidences are
of a specific variety: generically, if you see smooth, continuous line
segments you assume that they lie on a single object because it's quite
a coincidence for two objects to line up just so their contours
perfectly align. The gestalt psychologists called this the **principle
of continuity**. The insets for @fig-visionscience-migrant_mother show
several locations in the photo "Migrant Mother" by Dorothea Lange where
exactly this happens. Can you find more?

![Source: Migrant Mother (1936), by Dorothea Lange. On the right are
four examples of coincidences that violate the Gestalt principle of
continuity. From top-left clockwise: finger boundary aligns with lip
contour, cloth aligns with eye, creases align on two different shirts,
sleeve boundary is tangent to line on a different
shirt.](figures/visionscience/accidents_migrant_mother.jpg){#fig-visionscience-migrant_mother
width="100%"}

You have to be careful about this when developing vision algorithms. If
you assume *no* accidents will happen, then you will be in for some
trouble. This is one reason why older, rule-based methods failed,
because they did not handle exceptions to the rules.

## Cues for Support

When can we tell by looking that one object supports another? Consider
the ball over a surface depicted in . The drop shadow provides a strong
cue that the ball is above the surface in @fig-cues_for_suppport (a) and @fig-cues_for_suppport (c),
while the slack string provides evidence for the ball being on the
surface in figures @fig-cues_for_suppport (b) and
@fig-cues_for_suppport (c). Only is a doctored photograph.

In which of these figures does the ball appear to be in contact with the
ground plane?


![Cues for support. The relative location between an object,
and its shadow on the ground is a powerful cue for contact.](figures/visionscience/cues.png){#fig-cues_for_suppport}

## Looking at Raindrops

When it rains you get to see a rare sight. Every raindrop refracts a
panorama of the scene around it (@fig-droplets). What you are seeing is
not just one photo but hundreds of images of the scene, one in each
raindrop, all from slightly different angles. It turns out specially
designed cameras have been made that do the same thing, and they are
called **light field cameras**. These cameras use an array of tiny
lenses rather than raindrops, but the result is similar: they capture
what the scene looks like from hundreds of different viewing angles, and
from that density of information they can achieve many interesting
tricks, such as triangulating the depth of objects in the scene and
synthetically refocusing the image after it has been taken.

![A naturally occurring light field camera. This picture of raindrops on
a window contains hundreds of tiny images of the scene, one in each
raindrop. The images to the right are individual zoomed in raindrops
from the photo on the
left.](figures/visionscience/droplets.jpg){#fig-droplets width="100%"}

## Plato's Cave

In the *Allegory of the Cave*, Plato imagined a group of prisoners
chained up in a cave whose only experience of the outside world is
through shadows cast on the cave's wall. When we look at images, we
think we are directly seeing reality, but actually the situation is much
more like Plato's cave. An image is a crude projection of the underlying
physical state. It collapses all the 3D geometry into a flat 2D grid of
pixels. From any given viewing angle, most of the scene is occluded.
Materials are only depicted in terms of how photons bounce off their
surfaces; what's below the surface may be completely omitted. When you
look at a picture, think about all the things you cannot see. There's
more you don't see than you do. One job of a vision system is to infer
everything you can't see given everything you can.


:::{.column-margin}
The allegory of the cave has several layers of
indirection: the shadows are cast by puppets that are controlled by
puppeteers who act out interpretations of the broader reality
outside.
:::


## How Do You Know Something Is Wet?

Our visual system allows us to recognize not just objects, but also
material. We can tell if something will feel hot or cold, soft or hard,
smooth or rugged. We can also recognize if something is wet or not
without touching it, as shown in @fig-wet_sand.

![Why does wet sand look dark? The answer to this question is
interesting, but not as much as the question
itself.](figures/visionscience/wet_sand_2.jpeg){#fig-wet_sand
width="80%"}

Why does sand look darker when it's wet? Does everything wet look darker
than when it is dry? How can we tell which parts of the sand are dark
because they are wet and which ones are dark because they are under a
shadow?

## Concluding Remarks

:::{.column-margin}
Your brain is like a detective looking at all the
clues that reveal what the world is made of.
:::

This chapter was about learning to ask questions about what we see, why
things look the way they do, and how do we recognize them. There are so
many things that we see every day that we recognize by the way they
look, and yet we seldom ask ourselves why they look that way. Why does
the sky appear blue? Why does the moon always look the same? Why do hot
things become redder? There are many other sources you can consult that
explore why things look the way they do in the world. A beautiful book
on this topic is by Marcel Minnaert @Minnaert1993, first published in
1937.

After reading this chapter, you will not know how to implement anything
new, nor will you have learned new math or new algorithms. The goal of
this chapter was to show you some of the questions you should ask about
the visual world that you experience every day. The exercise to do now
is to go around with your camera and take pictures of interesting visual
phenomena.