Skip to content

Commit e1eff0f

Browse files
committed
GH-3561 Harden variant decoding
Test files are added to parquet-format project with commentary.
1 parent 5a6cf84 commit e1eff0f

7 files changed

Lines changed: 28 additions & 0 deletions

bad_data/README.md

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -33,3 +33,31 @@ These are files used for reproducing various bugs that have been reported.
3333
* ARROW-GH-43605.parquet: dictionary index page uses rle encoding but 0 as rle bit-width.
3434
* ARROW-GH-45185.parquet: test case of https://github.com/apache/arrow/issues/45185
3535
where repetition levels start with a 1 instead of 0.
36+
37+
38+
## Directory `variants`
39+
40+
This subdirectory contains files with malformed variant structures.
41+
42+
Robust implementations of variant decoders SHOULD reject these.
43+
44+
| File | Malformed Structure |
45+
|---------------------------------------------------------------|----------------------------------------------------------------------------|
46+
| `variant/int_overflow_in_bounds_check.parquet` | Triggers an overflow if 32 bit multiplication is used to calculate ranges. |
47+
| `variant/out_of_range_dictionary_size.parquet` | The dictionary is declared as larger than the data |
48+
| `variant/malformed_child_inside_well_formed_parent.parquet` | Parent is well formed; child is malformed |
49+
| `variant/out_of_range_child_offset.parquet` | The offset of an child element is out of range |
50+
| `variant/out_of_range_element_count.parquet` | The number of declared array elements is larger than the data |
51+
| `variant/bad_data/variants/over_deep_nested_children.parquet` | The hierarchy is excessively deep |
52+
53+
The first of these is the most critical, as this can trigger a memory allocation of many GiB, which may affect the operations of other worker threads in a shared process; an oversized dictionary may also trigger excessive memory allocation.
54+
55+
The out of range child and element files contain metadata referring to content past the end of the actual data field.
56+
On languages with strict range check, this will fail on read; extra verification simply changes when the failure is detected.
57+
For languages where range checks are not automatically, there is a risk of variant data referencing other data on the stack/in the heap.
58+
As this data is read only, there's no _direct_ threat to the integrity of the process, but it is still highly dangerous.
59+
60+
One notable file is `bad_data/variants/over_deep_nested_children.parquet`, which verifies that nested variant children over 500 levels deep is rejected. This number is subjective; it was chosen to be consistent with the JSON parser `org.apache.parquet.variant.VariantJsonParser`.
61+
62+
Currently excluded from these tests is any with an explicit limit on the size of a variant.
63+
Apache Spark places a limit on 128 MiB on each of the metadata and value fields here.
524 Bytes
Binary file not shown.
538 Bytes
Binary file not shown.
501 Bytes
Binary file not shown.
501 Bytes
Binary file not shown.
508 Bytes
Binary file not shown.
3.57 KB
Binary file not shown.

0 commit comments

Comments
 (0)