forked from Foundations-of-Computer-Vision/visionbook
-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathsimplesystem_final.qmd
More file actions
329 lines (258 loc) · 14.9 KB
/
simplesystem_final.qmd
File metadata and controls
329 lines (258 loc) · 14.9 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
# A Simple Vision System---Revisited {#sec-simple_system_revisited}
## Introduction
The goal of this last chapter is to embrace the optimism of the 2020s,
when we think that large-scale pretrained models can do it all, and use
a pretrained end-to-end vision system in order to solve the simple
visual world that we introduced in @sec-simplesystem.
In 2020, a number of big pretrained models (DALL-E @dalle1, GPT, chatGPT
@openai2023gpt4, AutoPilot) that were solving a wide set of tasks
(natural language processing, image generation, object classification,
depth perception, programming, etc.) emerged. These models are trained
with datasets that cover a wide set of diverse of domains that improve
their generalizations and require massive amounts of computation. They
can be used as is, with little to no fine tuning, or as components of
more complex systems. Researchers have shown many creative ways of
connecting these building blocks to produce astonishing results. However,
a close analysis shows that there is still an important gap to bridge in
order to build reliable and robust systems. Is the field of computer
vision in the 2020s naively optimistic like in the 1960s? It does not
seem to be the case, but only time will tell.
In this chapter we will revisit the simple-world problem we analyzed in
@sec-simplesystem with a hand-crafted approach by
looking for an appropriate high-performing pretrained model and applying
it out-of-the-box to solve our task.
## A Simple Neural Network
When we introduced the simple world in chapter
@sec-simplesystem, our goal was to recover the
three-dimensional (3D) structure of the scene from a single image
(@fig-simplesystem_revisited_img1).
{width="40%" #fig-simplesystem_revisited_img1}
We derived an algorithm to recover the 3D scene structure from the first
principles: we started studying the image formation process, we then
reasoned where the information might be preserved, how to build a
representation that makes the information explicit, and finally how to
integrate all the evidence left in the image in order to recover the 3D
information that was lost during the image formation process.
In this chapter, we will instead download a pretrained model that has
been trained to solve the 3D perception problem from single images. The
current state of the art is MiDaS @Ranftl2021, @Ranftl2022. The system
is a neural network trained end-to-end using a diverse set of real
images that had ground truth depth information and also using stereo
images.
Maybe the neural net is not so simple, but at the end, from a user's
perspective, it is just a function call. Before running the system out
of the box, it is worth understanding how it was trained, what the
output means, and how it can be adapted to our task.
### How Was It Trained?
This model has been trained using a mix of four datasets with real
images. The images below (@fig-examples_training_midas) show one sample
from each dataset alongside their ground-truth depth.
{width="100%" #fig-examples_training_midas}
As the model is trained with a diverse collections of data sources,
there is an ambiguity on the overall scale, camera parameters, and
stereo baselines. Therefore, the loss is defined in a way that it can be
invariant to all those ambiguities. The invariant loss makes the
training more robust, but it also makes the output less convenient to
use.
To deal with the ambiguities mentioned before, the model is trained to
predict disparity (inverse depth up to scale and shift). This means that
we can not directly use the output. We need to convert it first into
world coordinates.
### Adapting the Output {#sec-simple_system_revisited-adapting_the_output}
MiDaS returns disparity values, up to an unknown constant ($d_0$), but
what we need are world coordinates ($X,Y,Z$) for each pixel. To perform
this transformation, we will use what we learned in chapter
@sec-camera_parameters. We will need to know the intrinsic
and extrinsic camera parameters and also decide the value of $d_0$.
To translate the returned disparity for a given pixel
$d\left[n,m \right]$ into depth values $z\left[n,m \right]$, we need to
invert the disparity after adding the constant:
$$\begin{aligned}z \left[n,m \right] = \frac{1}{ d \left[n,m \right] + d_0}
\end{aligned}$$ Now we need to recover the world coordinates for every
pixel. Assuming a pinhole camera and using the depth
$z \left[n,m \right]$ and the camera matrix $\mathbf{K}$, we obtain the
3D coordinates for a pixel with image coordinates $(n,m)$ in the camera
reference frame as,
$$\begin{aligned}
\begin{bmatrix}
X \\
Y \\
Z
\end{bmatrix}
= z\left[n,m \right] \times \mathbf{K}^{-1} \times
\begin{bmatrix}
n \\
m \\
1
\end{bmatrix}
\end{aligned}
$${#eq-recoverworldcoord}
The camera parameters will have to account for the focal length, image
size, camera center, and the camera rotation with respect to the world
coordinates:
$$\mathbf{K} = \mathbf{M} \mathbf{R}
$$
The intrinsic camera parameters (using perspective projection) are:
$$
\begin{aligned}
\mathbf{M}
=
\begin{bmatrix}
a & 0 & n_0\\
0 & a & m_0 \\
0 & 0 & 1
\end{bmatrix}.
\end{aligned}
$$
:::{.column-margin}
{width="100%"}
:::
The extrinsic camera parameters will model the 3D rotation, which is
mainly around the $X$-axis (we will ignore the translation of the
origin):
$$
\begin{aligned}
\mathbf{R}
=
\begin{bmatrix}
1 & 0 & 0\\
0 & \cos(\theta) & -\sin(\theta) \\
0 & \sin(\theta) & \cos(\theta)
\end{bmatrix}.
\end{aligned}
$$
In order to apply @eq-recoverworldcoord, we need to define
certain parameters: $d_0$, $n_0$, $m_0$, $a$, and $\theta$. Although we
could determine these parameters using a calibration image, in this
case, we'll opt for a simpler heuristic approach:
- $d_0$: We will set this value to 0.
- $n_0,m_0$: Defines the camera center point, where $n_0=N/2$ and
$m_0=M/2$, when $N \times M$ is the input image size.
- $a$: Accounts for the image size and the camera focal length. One
standard setting is to make $a = N$, where $N$ is the image width.
- $\theta$: Represents the angle between the optical camera axis and
the ground plane. It would be possible to estimate this angle
accurately by determining the direction of the surface normal in
pixels corresponding to the ground and measuring the angle with
respect to the vertical axis. However, for our purposes, we will
simply set it to $\theta = \pi / 4$, as this is approximately the
angle at which the pictures were taken.
Now we are ready to use @eq-recoverworldcoord and the output
of the system.
## From 2D Images to 3D
Let's start running the model on the same input image that we used as a
running example in @sec-simplesystem. We just need to apply the neural
network to the input image, obtain the output disparity, and finally
transform the disparity into world coordinates. We do not have to detect
edges or any of the other features that we computed in @sec-simplesystem. Instead, the neural network has
learned to compute whatever features it deems necessary to solve the
task. We will not retrain the network using images from our simple
visual world.
Just as we did in @sec-simplesystem, to show that the algorithm for 3D
interpretation gives reasonable results, we can re-render the inferred
3D structure from different viewpoints.
{width="100%" #fig-views_midas}
It is quite remarkable that it works somewhat well even though this
system has been trained with natural images, which are quite dissimilar
to the simple world we are using here. However, despite the surprising
outputs, it is clear we cannot avoid a number of errors. For instance,
the ground is not recovered as a flat surface; instead it seems to be
concave. The cubes' edges are not straight and the faces are twisted
instead of flat.
We can compare this result with the 3D images recovered using the
algorithm we defined in @sec-simplesystem:
{width="100%" #fig-chatgpt_blockworld}
However, while the learned-based model may have lost accuracy in this
example that fits our simple world assumptions, it has gained in
robustness and generalization. The following figure compares the results
obtained running MiDaS (the learning-based approach) or the simple world
model (a hand-crafted approach).
{width="100%" #fig-comparison_midas_simpleworld}
When the images fit the assumptions of the world model, the hand-crafted
approach can produce beautiful results. However, it is very fragile, and
poor illumination can break the output of the edge detectors, which
results in a cascade of errors that propagate to the output. Conversely,
the learning-based approach seems to produce results of good (although
never perfect) quality regardless of the conditions. The hand-crafted
approach completely fails even if we make a small change in the scene
such as changing the ground plane from white to black.
## Large Language Model-based Scene Understanding
Another option is to use GPT4-v @openai2023gpt4v to interpret the scene.
GPT4-v can take as inputs images, and it allows asking questions about
images. @fig-chatgpt_blockworld shows the result of asking GPT4-v to
describe a simple world image.
This method is remarkably accurate, and now you know how it works. At
one level Gpt4-v is a transformer trained on massive data, using an
autoregressive modeling objective (@sec-transformers). On another level, you can understand
what rules that transformer may have learned: it detects edges in its
first few layers, uses filters that act like wavelets tiling Fourier
space, and it reasons about perspective geometry to find vanishing
points and estimate shape (as we have discussed along many chapters in
this book).
{width="100%" #fig-chatgpt_blockworld}
But it's not all solved. Because if we reconstruct the image from this
text, it looks not quite right: @fig-chatgpt_blockworld2. What is
missing? How would you improve it. As the field moves forward quickly,
it is likely that this particular example will be fixed, but there will
be other failures. And some of those failures will give clues to what is
the next fundamental question that will need to be addressed.
{width="50%" #fig-chatgpt_blockworld2}
## Unsolved Solved Computer Vision Problems
Computer vision is making amazing progress, but beware of the **unsolved
solved computer vision problems**. For some vision task, it might appear
as if there is one method that perfectly resolves the task. It might
appear as if some foundational problems in computer vision are solved.
It might often seem as if it is not possible to make progress regarding
certain questions. But more often than not, it is just an illusion, and
the solved problem was not really solved at all. And even when a problem
was genuinely solved (and probably is not), it's important to remember
that for every solved problem, there are many unsolved problems lurking
in the background.
:::{.column-margin}
When everything seems to be solved, remember that
advances are the result of exploring new directions, with an open mind,
and that creativity is a powerful tool.
:::
## Concluding Remarks
What have we learned in this chapter? We learned that using a large
pretrained model kind of works; it is not as clean as the results we had
in @sec-simplesystem, but it works in situations where our
simple heuristic method was breaking. And we can expect performances of
future pretrained models to keep improving.
With the hand-crafted model presented in @sec-simplesystem, we had a comprehensive understanding
of the mechanisms, assumptions, and limitations of our system. Another
interesting advantage of the hand-crafted model is that it required no
learning and no training data. Once the model was written down it was
ready to work. It worked really well when the conditions were met, but
it was fragile and unreliable as soon as we deviated from those
conditions. Our hand-crafted model did not generalize outside the space
defined by our assumptions. When the system failed, it was easy to point
out exactly what went wrong with it, but knowing what was wrong was not
helping in making it work better.
Conversely, the pretrained system for 3D reconstruction seems to have
generalized to our setting, at least to some degree, and it is able to
deal with configurations that were not part of our modeling assumptions.
The pretrained model is robust to changes on illumination and does not
rely on fragile operations such as edge detection or figure-ground
segmentation. But somehow, equivalent operations might be taking place
inside the network. Information about 3D is still mostly contained in
the edges, and information has to be propagated from the edges and
junctions to the flat surfaces. How does the pretrained model work
exactly? Has it learned some of the heuristics we introduced by hand, or
is it using other ones?
:::{.column-margin}
{width="100%"}
:::
GPT4-v seems to be capable of recognizing the geometric figures and
their spatial configuration but the integration with the rendering
engine seems fragile and was not able to create a new image with the
same elements (although it is very close!). But why did it fail?
The precise answers to these questions do not matter right now. The
field is evolving quickly and many of those problems will be addressed
over time. What matters is that you should challenge existing
approaches, and find what the relevant questions are for whatever new
systems you encounter.
With learning-based systems we find ourselves doing experiments similar
to the ones performed by psychophysics in trying to understand the
mechanisms implemented by the human visual system.
{width="6%" fig-align="center"}