Skip to content

Commit 45e0df2

Browse files
committed
doc: Add error handling of SXM
Signed-off-by: Vincent Liu <shuntian.liu2@cloud.com>
1 parent 8716d26 commit 45e0df2

1 file changed

Lines changed: 102 additions & 2 deletions

File tree

  • doc/content/xapi/storage

doc/content/xapi/storage/sxm.md

Lines changed: 102 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,14 @@ Title: Storage migration
88
- [But we have storage\_mux.ml](#but-we-have-storage_muxml)
99
- [Thought experiments on an alternative design](#thought-experiments-on-an-alternative-design)
1010
- [Design](#design)
11-
- [SMAPIv1 Migration](#smapiv1-migration)
11+
- [SMAPIv1 migration](#smapiv1-migration)
12+
- [SMAPIv3 migration](#smapiv3-migration)
13+
- [Error Handling](#error-handling)
14+
- [Preparation (SMAPIv1 and SMAPIv3)](#preparation-smapiv1-and-smapiv3)
15+
- [Snapshot and mirror failure (SMAPIv1)](#snapshot-and-mirror-failure-smapiv1)
16+
- [Mirror failure (SMAPIv3)](#mirror-failure-smapiv3)
17+
- [Copy failure (SMAPIv1)](#copy-failure-smapiv1)
18+
- [SMAPIv1 Migration implementation detail](#smapiv1-migration-implementation-detail)
1219
- [Receiving SXM](#receiving-sxm)
1320
- [Xapi code](#xapi-code)
1421
- [Storage code](#storage-code)
@@ -109,8 +116,100 @@ Note that later on storage_smapi{v1,v3}_migrate.ml will still have the flexibili
109116
to call remote SMAPIv2 functions, such as `Remote.VDI.attach dest_sr vdi`, and
110117
it will be handled just as before.
111118

119+
## SMAPIv1 migration
112120

113-
## SMAPIv1 Migration
121+
At a high level, mirror establishment for SMAPIv1 works as follows:
122+
123+
1. Take a snapshot of a VDI that is attached to VM1. This gives us an immutable
124+
copy of the current state of the VDI, with all the data until the point we took
125+
the snapshot. This is illustrated in the diagram as a VDI and its snapshot connecting
126+
to a shared parent, which stores the shared content for the snapshot and the writable
127+
VDI from which we took the snapshot (snapshot)
128+
2. Mirror the writable VDI to the server hosts: this means that all writes that goes to the
129+
client VDI will also be written to the mirrored VDI on the remote host (mirror)
130+
3. Copy the immutable snapshot from our local host to the remote (copy)
131+
4. Compose the mirror and the snapshot to form a single VDI
132+
5. Destroy the snapshot on the local host (cleanup)
133+
134+
135+
more detail to come...
136+
137+
## SMAPIv3 migration
138+
139+
More detail to come...
140+
141+
## Error Handling
142+
143+
Storage migration is a long-running process, and is prone to failures in each
144+
step. Hence it is important specifying what errors could be raised at each step
145+
and their significance. This is beneficial both for the user and for triaging.
146+
147+
There are two general cleanup functions in SXM: `MIRROR.receive_cancel` and
148+
`MIRROR.stop`. The former is for cleaning up whatever has been created by `MIRROR.receive_start`
149+
on the destination host (such as VDIs for receiving mirrored data). The latter is
150+
a more comprehensive function that attempts to "undo" all the side effects that
151+
was done during the SXM, and also calls `receive_cancel` as part of its operations.
152+
153+
Currently error handling was done by building up a list of cleanup functions in
154+
the `on_fail` list ref as the function executes. For example, if the `receive_start`
155+
has been completed successfully, add `receive_cancel` to the list of cleanup functions.
156+
And whenever an exception is encountered, just execute whatever has been added
157+
to the `on_fail` list ref. This is convenient, but does entangle all the error
158+
handling logic with the core SXM logic itself, making the code rather than hard
159+
to understand and maintain.
160+
161+
The idea to fix this is to introduce explicit "stages" during the SXM and define
162+
explicitly what error handling should be done if it fails at a certain stage. This
163+
helps separate the error handling logic into the `with` part of a `try with` block,
164+
which is where they are supposed to be. Since we need to accommodate the existing
165+
SMAPIv1 migration (which has more stages than SMAPIv3), the following stages are
166+
introduced: preparation (v1,v3), snapshot(v1), mirror(v1, v3), copy(v1). Note that
167+
each stage also roughly corresponds to a helper function that is called within `MIRROR.start`,
168+
which is the wrapper function that initiates storage migration. And each helper
169+
functions themselves would also have error handling logic within themselves as
170+
needed (e.g. see `Storage_smapiv1_migrate.receive_start) to deal with exceptions
171+
that happen within each helper functions.
172+
173+
### Preparation (SMAPIv1 and SMAPIv3)
174+
175+
The preparation stage generally corresponds to what is done in `receive_start`, and
176+
this function itself will handle exceptions when there are partial failures within
177+
the function itself, such as an exception after the receiving VDI is created.
178+
It will use the old-style `on_fail` function but only with a limited scope.
179+
180+
There is nothing to be done at a higher level (i.e within `MIRROR.start` which
181+
calls `receive_start`) if preparation has failed.
182+
183+
### Snapshot and mirror failure (SMAPIv1)
184+
185+
For SMAPIv1, the mirror is done in a bit cumbersome way. The end goal is to establish
186+
connections between two tapdisk processes on the source and destination hosts.
187+
To achieve this goal, xapi will do two main jobs: 1. create a connection between two
188+
hosts and pass the connection to tapdisk; 2. create a snapshot as a starting point
189+
of the mirroring process.
190+
191+
Therefore handling of failures at these two stages are similar: clean up what was
192+
done in the preparation stage by calling `receive_cancel`, and that is almost it.
193+
Again, we will leave whatever is needed for partial failure handling within those
194+
functions themselves and only clean up at a stage-level in `storage_migrate.ml`
195+
196+
Note that `receive_cancel` is a multiplexed function for SMAPIv1 and SMAPIv3, which
197+
means different clean up logic will be executed depending on what type of SR we
198+
are migrating from.
199+
200+
### Mirror failure (SMAPIv3)
201+
202+
To be filled...
203+
204+
### Copy failure (SMAPIv1)
205+
206+
The final step of storage migration for SMAPIv1 is to copy the snapshot from the
207+
source to the destination. At this stage, most of the side effectful work has been
208+
done, so we do need to call `MIRROR.stop` to clean things up if we experience an
209+
failure during copying.
210+
211+
212+
## SMAPIv1 Migration implementation detail
114213

115214
```mermaid
116215
sequenceDiagram
@@ -1873,3 +1972,4 @@ let pre_deactivate_hook ~dbg ~dp ~sr ~vdi =
18731972
s.failed <- true
18741973
)
18751974
```
1975+

0 commit comments

Comments
 (0)