Skip to content

Commit 4697f42

Browse files
committed
[ntuple][NFC] update Merging.md
1 parent 4b99138 commit 4697f42

1 file changed

Lines changed: 44 additions & 16 deletions

File tree

tree/ntuple/doc/Merging.md

Lines changed: 44 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -15,11 +15,13 @@ Please note that the RNTupleMerger is currently experimental and the content of
1515

1616
Currently there is no guarantee for the user about which mode will be used to generate the merged RNTuple.
1717
At the moment, this is how it works:
18-
- if both compression and encoding of the target column match those of the source column, L1 is used;
19-
- otherwise, if compression matches but encoding doesn't, L2 is used;
20-
- otherwise L3 is used.
18+
- if the compression of the target column match that of the source column, L1 is used;
19+
- otherwise, L2 is used.
2120

22-
Note that L0 and L4 are currently never used.
21+
L0, L3 and L4 are currently never used.
22+
23+
**NOTE**: prior to ROOT 6.42, if two columns had the same compression but different encoding they would undergo L3 merging (implying a recompression and resealing);
24+
from 6.42 onwards the RNTupleMerger will instead attach a new column to the parent field as a new representation and L1-merge them.
2325

2426
## Goal
2527
The goal of the RNTuple merging process is producing one output RNTuple from *N* input RNTuples that can be used as if it were produced directly in the merged state. This means that:
@@ -44,15 +46,16 @@ Consequences of R3 and R4:
4446
The following properties are currently true but they are subject to change:
4547

4648
* P1: all output pages have the **same compression** (which may be different from the input pages' compression);
47-
* P2: all pages in the same output column have the **same encoding** (which may be different from the inputs' encoding);
48-
* P3: the output clusters are **the same as the input clusters**;
49-
* P4: the output RNTuple **always has 1 cluster group**
49+
* P2: the output clusters are **the same as the input clusters**;
50+
* P3: the output RNTuple **always has 1 cluster group**
51+
52+
Note that these properties influence and are influenced by the level of merging used.
53+
E.g. P1 is currently true because we only support L1 merging of pages with identical compressions. This is a limitation that we intend to lift at some point (both for L1 and L0 if we ever support it).
54+
P2 and P3 would not necessarily be true with L4 support (which might be desirable in some cases, e.g. to group pages into smaller/larger clusters).
5055

51-
Note that these properties influence and are influenced by the level of merging used.
52-
E.g. P1 and P2 are currently true because we only support L1 merging of pages with identical compressions. This is a limitation that we intend to lift at some point (both for L1 and L0 if we ever support it).
53-
P3 and P4 would not necessarily be true with L4 support (which might be desirable in some cases, e.g. to group pages into smaller/larger clusters).
56+
Also note that the output pages coming from matching columns of a field may use mixed encodings.
5457

55-
Therefore we *will* want to drop these properties at some point, in order to improve the capabilities of the Merger.
58+
Therefore we *will* want to drop at least some of these properties at some point, in order to improve the capabilities of the Merger.
5659

5760
## High-level description
5861
The merging process requires at least 1 input, in the form of an `RPageSource`.
@@ -64,14 +67,15 @@ In `Union` mode only, we allow any subsequent input RNTuple to define new fields
6467
## Descriptor compatibility and validation
6568
Whenever a new input is processed, we compare its descriptor with the output descriptor to verify that merging is possible.
6669

67-
The comparison function does 3 main things:
70+
The comparison function does 4 main things:
6871
- collect all "extra destination fields" (i.e. fields that exist in the output but not in this input RNTuple)
6972
- collect all "extra source fields" from the input RNTuple
70-
- collect and validate all common fields.
73+
- collect and validate all common fields
74+
- collect all columns that need to be extended with additional representations.
7175

72-
If the Merging Mode is set to **Filter** we require the "extra destination fields" list to be empty.
73-
If the Merging Mode is set to **Strict** we require both the "extra destination fields" and "extra source fields" lists to be empty.
74-
If the Merging Mode is set to **Union**, the "extra source fields" list is used to late model extend the destination model.
76+
If the merging mode is set to **Filter** we require the "extra destination fields" list to be empty.
77+
If the merging mode is set to **Strict** we require both the "extra destination fields" and "extra source fields" lists to be empty.
78+
If the merging mode is set to **Union**, the "extra source fields" list is used to late model extend the destination model.
7579

7680
As for common fields, they are matched by name and validated as follows:
7781
- any field that is projected in the destination must be also projected in the source and must be projected to the same field;
@@ -90,3 +94,27 @@ As for common fields, they are matched by name and validated as follows:
9094

9195

9296
<sup>1</sup>: these restrictions will likely not be required for L4 merging.
97+
98+
## Column representation extension
99+
In all merging modes, we allow new column representations to be attached to the source fields. This is done to allow for L1 merging of columns with different encodings, which would otherwise require recompressing.
100+
These new column representations are added to the output RNTuple's footer and become part of its Schema Extension section. Note that in general these columns will be added as deferred *and* suppressed.
101+
102+
**Technical note**: this is *not* done via the regular late model extension API, but uses internal functionality.
103+
104+
We add new (physical) column representations in the following cases:
105+
106+
- when one or more columns of a field has a different type than its matching counterpart in the destination RNTuple;
107+
- when one or more columns of a field has the same type but different metadata than its matching counterpart in the destination RNTuple (e.g. in case of a Real32Quant column, different bit width or value range).
108+
109+
Whenever we extend a physical column that is referred to by one or more alias columns in some projected fields, we also add a corresponding new alias column in those fields.
110+
111+
#### Example
112+
Suppose we merge source RNTuples **S1** and **S2**, each with the following fields:
113+
114+
1. `foo` of type `int`
115+
1. `fooProj` projecting onto field `foo`
116+
117+
Suppose that S1 is compressed and thus its `foo` field is represented by a column of type `kSplitInt32`, whereas S2 is uncompressed and its `foo` field is represented by a column `kInt32`.
118+
When merging S1 and S2 we collate those two representations under the same field `foo`, so that it will now have representatives: `{kSplitInt32, kInt32}`.
119+
At the same time, we add a second alias column to the field `fooProj`, which will now have its first column aliasing the `kSplitInt32` column (column 0 of field `foo`) and its second one aliasing the `kInt32` one (column 1 of field `foo`).
120+

0 commit comments

Comments
 (0)