You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: tree/ntuple/doc/Merging.md
+44-16Lines changed: 44 additions & 16 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -15,11 +15,13 @@ Please note that the RNTupleMerger is currently experimental and the content of
15
15
16
16
Currently there is no guarantee for the user about which mode will be used to generate the merged RNTuple.
17
17
At the moment, this is how it works:
18
-
- if both compression and encoding of the target column match those of the source column, L1 is used;
19
-
- otherwise, if compression matches but encoding doesn't, L2 is used;
20
-
- otherwise L3 is used.
18
+
- if the compression of the target column match that of the source column, L1 is used;
19
+
- otherwise, L2 is used.
21
20
22
-
Note that L0 and L4 are currently never used.
21
+
L0, L3 and L4 are currently never used.
22
+
23
+
**NOTE**: prior to ROOT 6.42, if two columns had the same compression but different encoding they would undergo L3 merging (implying a recompression and resealing);
24
+
from 6.42 onwards the RNTupleMerger will instead attach a new column to the parent field as a new representation and L1-merge them.
23
25
24
26
## Goal
25
27
The goal of the RNTuple merging process is producing one output RNTuple from *N* input RNTuples that can be used as if it were produced directly in the merged state. This means that:
@@ -44,15 +46,16 @@ Consequences of R3 and R4:
44
46
The following properties are currently true but they are subject to change:
45
47
46
48
* P1: all output pages have the **same compression** (which may be different from the input pages' compression);
47
-
* P2: all pages in the same output column have the **same encoding** (which may be different from the inputs' encoding);
48
-
* P3: the output clusters are **the same as the input clusters**;
49
-
* P4: the output RNTuple **always has 1 cluster group**
49
+
* P2: the output clusters are **the same as the input clusters**;
50
+
* P3: the output RNTuple **always has 1 cluster group**
51
+
52
+
Note that these properties influence and are influenced by the level of merging used.
53
+
E.g. P1 is currently true because we only support L1 merging of pages with identical compressions. This is a limitation that we intend to lift at some point (both for L1 and L0 if we ever support it).
54
+
P2 and P3 would not necessarily be true with L4 support (which might be desirable in some cases, e.g. to group pages into smaller/larger clusters).
50
55
51
-
Note that these properties influence and are influenced by the level of merging used.
52
-
E.g. P1 and P2 are currently true because we only support L1 merging of pages with identical compressions. This is a limitation that we intend to lift at some point (both for L1 and L0 if we ever support it).
53
-
P3 and P4 would not necessarily be true with L4 support (which might be desirable in some cases, e.g. to group pages into smaller/larger clusters).
56
+
Also note that the output pages coming from matching columns of a field may use mixed encodings.
54
57
55
-
Therefore we *will* want to drop these properties at some point, in order to improve the capabilities of the Merger.
58
+
Therefore we *will* want to drop at least some of these properties at some point, in order to improve the capabilities of the Merger.
56
59
57
60
## High-level description
58
61
The merging process requires at least 1 input, in the form of an `RPageSource`.
@@ -64,14 +67,15 @@ In `Union` mode only, we allow any subsequent input RNTuple to define new fields
64
67
## Descriptor compatibility and validation
65
68
Whenever a new input is processed, we compare its descriptor with the output descriptor to verify that merging is possible.
66
69
67
-
The comparison function does 3 main things:
70
+
The comparison function does 4 main things:
68
71
- collect all "extra destination fields" (i.e. fields that exist in the output but not in this input RNTuple)
69
72
- collect all "extra source fields" from the input RNTuple
70
-
- collect and validate all common fields.
73
+
- collect and validate all common fields
74
+
- collect all columns that need to be extended with additional representations.
71
75
72
-
If the Merging Mode is set to **Filter** we require the "extra destination fields" list to be empty.
73
-
If the Merging Mode is set to **Strict** we require both the "extra destination fields" and "extra source fields" lists to be empty.
74
-
If the Merging Mode is set to **Union**, the "extra source fields" list is used to late model extend the destination model.
76
+
If the merging mode is set to **Filter** we require the "extra destination fields" list to be empty.
77
+
If the merging mode is set to **Strict** we require both the "extra destination fields" and "extra source fields" lists to be empty.
78
+
If the merging mode is set to **Union**, the "extra source fields" list is used to late model extend the destination model.
75
79
76
80
As for common fields, they are matched by name and validated as follows:
77
81
- any field that is projected in the destination must be also projected in the source and must be projected to the same field;
@@ -90,3 +94,27 @@ As for common fields, they are matched by name and validated as follows:
90
94
91
95
92
96
<sup>1</sup>: these restrictions will likely not be required for L4 merging.
97
+
98
+
## Column representation extension
99
+
In all merging modes, we allow new column representations to be attached to the source fields. This is done to allow for L1 merging of columns with different encodings, which would otherwise require recompressing.
100
+
These new column representations are added to the output RNTuple's footer and become part of its Schema Extension section. Note that in general these columns will be added as deferred *and* suppressed.
101
+
102
+
**Technical note**: this is *not* done via the regular late model extension API, but uses internal functionality.
103
+
104
+
We add new (physical) column representations in the following cases:
105
+
106
+
- when one or more columns of a field has a different type than its matching counterpart in the destination RNTuple;
107
+
- when one or more columns of a field has the same type but different metadata than its matching counterpart in the destination RNTuple (e.g. in case of a Real32Quant column, different bit width or value range).
108
+
109
+
Whenever we extend a physical column that is referred to by one or more alias columns in some projected fields, we also add a corresponding new alias column in those fields.
110
+
111
+
#### Example
112
+
Suppose we merge source RNTuples **S1** and **S2**, each with the following fields:
113
+
114
+
1.`foo` of type `int`
115
+
1.`fooProj` projecting onto field `foo`
116
+
117
+
Suppose that S1 is compressed and thus its `foo` field is represented by a column of type `kSplitInt32`, whereas S2 is uncompressed and its `foo` field is represented by a column `kInt32`.
118
+
When merging S1 and S2 we collate those two representations under the same field `foo`, so that it will now have representatives: `{kSplitInt32, kInt32}`.
119
+
At the same time, we add a second alias column to the field `fooProj`, which will now have its first column aliasing the `kSplitInt32` column (column 0 of field `foo`) and its second one aliasing the `kInt32` one (column 1 of field `foo`).
0 commit comments