Skip to content

Commit f39f275

Browse files
authored
GH-48868: [Doc] Document security model for the Arrow formats (#48870)
### Rationale for this change Accessing Arrow data or any of the formats can have non-trivial security implications, this is an attempt at documenting those. ### What changes are included in this PR? Add a Security Considerations page in the Format section. **Doc preview:** https://s3.amazonaws.com/arrow-data/pr_docs/48870/format/Security.html ### Are these changes tested? N/A ### Are there any user-facing changes? No. * GitHub Issue: #48868 Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>
1 parent d31644a commit f39f275

File tree

5 files changed

+306
-4
lines changed

5 files changed

+306
-4
lines changed

docs/source/developers/cpp/fuzzing.rst

Lines changed: 24 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -26,10 +26,10 @@ Fuzzing Arrow C++
2626
To make the handling of invalid input more robust, we have enabled
2727
fuzz testing on several parts of the Arrow C++ feature set, currently:
2828

29-
* the IPC stream format
30-
* the IPC file format
31-
* the Parquet file format
32-
* the CSV file format
29+
* the IPC stream reader
30+
* the IPC file reader
31+
* the Parquet file reader
32+
* the CSV file reader
3333

3434
We welcome any contribution to expand the scope of fuzz testing and cover
3535
areas ingesting potentially invalid or malicious data.
@@ -110,3 +110,23 @@ dependencies, you may need to install these before building the fuzz targets:
110110
111111
$ conda install clang clangxx compiler-rt
112112
$ cmake .. --preset=fuzzing
113+
114+
115+
.. _fuzz-regression-files:
116+
117+
Regression files
118+
================
119+
120+
When a fuzzer-detected bug is found and fixed, the corresponding reproducer
121+
must be stored in the `arrow-testing <https://github.com/apache/arrow-testing/>`__
122+
repository to ensure that the code doesn't regress.
123+
124+
The locations for these files are as follows:
125+
126+
* IPC streams: in the ``data/arrow-ipc-stream`` `directory <https://github.com/apache/arrow-testing/tree/master/data/arrow-ipc-stream>`__
127+
* IPC files: in the ``data/arrow-ipc-file`` `directory <https://github.com/apache/arrow-testing/tree/master/data/arrow-ipc-file>`__
128+
* Parquet files: in the ``data/parquet/fuzzing`` `directory <https://github.com/apache/arrow-testing/tree/master/data/parquet/fuzzing>`__
129+
* CSV files: in the ``data/csv/fuzzing`` `directory <https://github.com/apache/arrow-testing/tree/master/data/csv/fuzzing>`__
130+
131+
Most of those files are invalid files for their respective formats and stress
132+
proper error detection and reporting in the implementation code.

docs/source/format/CanonicalExtensions.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -285,6 +285,8 @@ UUID
285285
A specific UUID version is not required or guaranteed. This extension represents
286286
UUIDs as FixedSizeBinary(16) with big-endian notation and does not interpret the bytes in any way.
287287

288+
.. _opaque_extension:
289+
288290
Opaque
289291
======
290292

docs/source/format/Integration.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -561,6 +561,8 @@ in ``datagen.py``):
561561
* Extension Types
562562

563563

564+
.. _format-gold-integration-files:
565+
564566
Gold File Integration Tests
565567
~~~~~~~~~~~~~~~~~~~~~~~~~~~
566568

docs/source/format/Security.rst

Lines changed: 277 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,277 @@
1+
.. Licensed to the Apache Software Foundation (ASF) under one
2+
.. or more contributor license agreements. See the NOTICE file
3+
.. distributed with this work for additional information
4+
.. regarding copyright ownership. The ASF licenses this file
5+
.. to you under the Apache License, Version 2.0 (the
6+
.. "License"); you may not use this file except in compliance
7+
.. with the License. You may obtain a copy of the License at
8+
9+
.. http://www.apache.org/licenses/LICENSE-2.0
10+
11+
.. Unless required by applicable law or agreed to in writing,
12+
.. software distributed under the License is distributed on an
13+
.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
14+
.. KIND, either express or implied. See the License for the
15+
.. specific language governing permissions and limitations
16+
.. under the License.
17+
18+
.. _format_security:
19+
20+
***********************
21+
Security Considerations
22+
***********************
23+
24+
This document describes security considerations when reading Arrow
25+
data from untrusted sources. It focuses specifically on data passed in a
26+
standardized serialized form (such as a IPC stream), as opposed to an
27+
implementation-specific native representation (such as ``arrow::Array`` in C++).
28+
29+
.. note::
30+
Implementation-specific concerns, such as bad API usage, are out of scope
31+
for this document. Please refer to the implementation's own documentation.
32+
33+
34+
Who should read this
35+
====================
36+
37+
You should read this document if you belong to either of these two categories:
38+
39+
1. *users* of Arrow: that is, developers of third-party libraries or applications
40+
that don't directly implement the Arrow formats or protocols, but instead
41+
call language-specific APIs provided by an Arrow library (as defined below);
42+
43+
2. *implementors* of Arrow libraries: that is, libraries that provide APIs
44+
abstracting away from the details of the Arrow formats and protocols; such
45+
libraries include, but are not limited to, the official Arrow implementations
46+
documented on https://arrow.apache.org.
47+
48+
49+
Columnar Format
50+
===============
51+
52+
Invalid data
53+
------------
54+
55+
The Arrow :ref:`columnar format <_format_columnar>` is an efficient binary
56+
representation with a focus on performance and efficiency. While the format
57+
does not store raw pointers, the contents of Arrow buffers are often
58+
combined and converted to pointers into the process' address space.
59+
Invalid Arrow data may therefore cause invalid memory accesses
60+
(potentially crashing the process) or access to non-Arrow data
61+
(potentially allowing an attacker to exfiltrate confidential information).
62+
63+
For instance, to read a value from a Binary array, one needs to 1) read the
64+
values' offsets from the array's offsets buffer, and 2) read the range of bytes
65+
delimited by these offsets in the array's data buffer. If the offsets are
66+
invalid (deliberately or not), then step 2) can access memory outside of the
67+
data buffer's range.
68+
69+
Another instance of invalid data lies in the values themselves. For example,
70+
a String array is only allowed to contain valid UTF-8 data, but an untrusted
71+
source might have emitted invalid UTF-8 under the disguise of a String array.
72+
An unsuspecting algorithm that is only specified for valid UTF-8 inputs might
73+
lead to dangerous behavior (for example by reading memory out of bounds when
74+
looking for an UTF-8 character boundary).
75+
76+
Fortunately, knowing its schema, it is possible to validate Arrow data up front,
77+
so that reading this data will not pose any danger later on.
78+
79+
.. TODO:
80+
For each layout, we should list the associated security risks and the recommended
81+
steps to validate (perhaps in Columnar.rst)
82+
83+
Advice for users
84+
''''''''''''''''
85+
86+
Arrow implementations often assume inputs follow the specification to provide
87+
high speed processing. It is **extremely recommended** that your application
88+
explicitly validates any Arrow data it receives under serialized form
89+
from untrusted sources. Many Arrow implementations provide explicit APIs to
90+
perform such validation.
91+
92+
.. TODO: link to some validation APIs for the main implementations here?
93+
94+
Advice for implementors
95+
'''''''''''''''''''''''
96+
97+
It is **recommended** that you provide dedicated APIs to validate Arrow arrays
98+
and/or record batches. Users will be able to utilize those APIs to assert whether
99+
data coming from untrusted sources can be safely accessed.
100+
101+
A typical validation API must return a well-defined error, not crash, if the
102+
given Arrow data is invalid; it must always be safe to execute regardless of
103+
whether the data is valid or not.
104+
105+
Uninitialized data
106+
------------------
107+
108+
A less obvious pitfall is when some parts of an Arrow array are left uninitialized.
109+
For example, if an element of a primitive Arrow array is marked null through its
110+
validity bitmap, the corresponding value slot in the values buffer can be ignored
111+
for all purposes. It is therefore tempting, when creating an array with null
112+
values, to not initialize the corresponding value slots.
113+
114+
However, this then introduces a serious security risk if the Arrow data is
115+
serialized and published (e.g. using IPC or Flight) such that it can be
116+
accessed by untrusted users. Indeed, the uninitialized value slot can
117+
reveal data left by a previous memory allocation made in the same process.
118+
Depending on the application, this data could contain confidential information.
119+
120+
Advice for users and implementors
121+
'''''''''''''''''''''''''''''''''
122+
123+
When creating a Arrow array, it is **recommended** that you never leave any
124+
data uninitialized in a buffer if the array might be sent to, or read by, an
125+
untrusted third-party, even when the uninitialized data is logically
126+
irrelevant. The easiest way to do this is to zero-initialize any buffer that
127+
will not be populated in full.
128+
129+
If it is determined, through benchmarking, that zero-initialization imposes
130+
an excessive performance cost, a library or application may instead decide
131+
to use uninitialized memory internally as an optimization; but it should then
132+
ensure all such uninitialized values are cleared before passing the Arrow data
133+
to another system.
134+
135+
.. note::
136+
Sending Arrow data out of the current process can happen *indirectly*,
137+
for example if you produce it over the C Data Interface and the consumer
138+
persists it using the IPC format on some public storage.
139+
140+
141+
C Data Interface
142+
================
143+
144+
The C Data Interface contains raw pointers into the process' address space.
145+
It is generally not possible to validate that those pointers are legitimate;
146+
read from such a pointer may crash or access unrelated or bogus data.
147+
148+
Advice for users
149+
----------------
150+
151+
You should **never** consume a C Data Interface structure from an untrusted
152+
producer, as it is by construction impossible to guard against dangerous
153+
behavior in this case.
154+
155+
Advice for implementors
156+
-----------------------
157+
158+
When consuming a C Data Interface structure, you can assume that it comes from
159+
a trusted producer, for the reason explained above. However, it is still
160+
**recommended** that you validate it for soundness (for example that the right
161+
number of buffers is passed for a given datatype), as a trusted producer can
162+
have bugs anyway.
163+
164+
165+
IPC Format
166+
==========
167+
168+
The :ref:`IPC format <_ipc-message-format>` is a serialization format for the
169+
columnar format with associated metadata. Reading an IPC stream or file from
170+
an untrusted source comes with similar caveats as reading the Arrow columnar
171+
format.
172+
173+
The additional signalisation and metadata in the IPC format come with
174+
their own risks. For example, buffer offsets and sizes encoded in IPC messages
175+
may be out of bounds for the IPC stream; Flatbuffers-encoded metadata payloads
176+
may carry incorrect offsets pointing outside of the designated metadata area.
177+
178+
Advice for users
179+
----------------
180+
181+
Arrow libraries will typically ensure IPC streams are structurally valid
182+
but may not also validate the underlying Array data. It is **extremely recommended**
183+
that you use the appropriate APIs to validate the Arrow data read from an untrusted IPC stream.
184+
185+
Advice for implementors
186+
-----------------------
187+
188+
It is **extremely recommended** to run dedicated validation checks when decoding
189+
the IPC format, to make sure that the decoding can not induce unwanted behavior.
190+
Failing those checks should return a well-known error to the caller, not crash.
191+
192+
193+
Extension Types
194+
===============
195+
196+
Extension types typically register a custom deserialization hook so that they
197+
can be automatically recreated when reading from an external source (for example
198+
using IPC). The deserialization hook has to decode the extension type's parameters
199+
from a string or binary payload specific to the extension type.
200+
:ref:`Typical examples <opaque_extension>` use a bespoke JSON representation
201+
with object fields representing the various parameters.
202+
203+
When reading data from an untrusted source, any registered deserialization hook
204+
could be called with an arbitrary payload. It is therefore of primary importance
205+
that the hook be safe to call on invalid, potentially malicious, data. This mandates
206+
the use of a robust metadata serialization schema (such as JSON, but not Python's
207+
`pickle <https://docs.python.org/3/library/pickle.html>`__ or R's
208+
`serialize() <https://stat.ethz.ch/R-manual/R-devel/library/base/html/serialize.html>`__,
209+
for example).
210+
211+
Advice for users and implementors
212+
---------------------------------
213+
214+
When designing an extension type, it is **extremely recommended** to choose a
215+
metadata serialization format that is robust against potentially malicious
216+
data.
217+
218+
When implementing an extension type, it is **recommended** to ensure that the
219+
deserialization hook is able to detect, and error out gracefully, if the
220+
serialized metadata payload is invalid.
221+
222+
223+
Testing for robustness
224+
======================
225+
226+
Advice for implementors
227+
-----------------------
228+
229+
For APIs that may process untrusted inputs, it is **extremely recommended**
230+
that your unit tests exercise your APIs against typical kinds of invalid data.
231+
For example, your validation APIs will have to be tested against invalid Binary
232+
or List offsets, invalid UTF-8 data in a String array, etc.
233+
234+
Testing against known regression files
235+
''''''''''''''''''''''''''''''''''''''
236+
237+
The `arrow-testing <https://github.com/apache/arrow-testing/>`__ repository
238+
contains regression files for various formats, such as the IPC format.
239+
240+
Two categories of files are especially noteworthy and can serve to exercise
241+
an Arrow implementation's robustness:
242+
243+
1. :ref:`gold integration files <format-gold-integration-files>` that are valid
244+
files to exercise compliance with Arrow IPC features;
245+
2. :ref:`fuzz regression files <fuzz-regression-files>` that have been automatically
246+
generated each time a fuzzer founds a bug triggered by a specific (usually invalid)
247+
input for a given format.
248+
249+
Fuzzing
250+
'''''''
251+
252+
It is **recommended** that you go one step further and set up some kind of
253+
automated robustness testing against unforeseen inputs. One typical approach
254+
is though fuzzing, possibly coupled with a runtime instrumentation framework
255+
that detects dangerous behavior (such as Address Sanitizer in C++ or
256+
Rust).
257+
258+
A reasonable way of setting up fuzzing for Arrow is using the IPC format as
259+
a binary payload; the fuzz target should not only attempt to decode the IPC
260+
stream as Arrow data, but it should then validate the Arrow data.
261+
This will strengthen both the IPC decoder and the validation routines
262+
against invalid, potentially malicious data. Finally, if validation comes out
263+
successfully, the fuzz target may exercise some important core functionality,
264+
such as printing the data for human display; this will help ensure that the
265+
validation routine did not let through invalid data that may lead to dangerous
266+
behavior.
267+
268+
269+
Non-Arrow formats and protocols
270+
===============================
271+
272+
Arrow data can also be sent or stored using third-party formats such as Apache
273+
Parquet. Those formats may or may not present the same security risks as listed
274+
above (for example, the precautions around uninitialized data may not apply
275+
in a format like Parquet that does not create any value slots for null elements).
276+
We suggest you refer to these projects' own documentation for more concrete
277+
guidelines.

docs/source/format/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -37,5 +37,6 @@ Specifications
3737
Flight
3838
FlightSql
3939
ADBC
40+
Security
4041
Integration
4142
Glossary

0 commit comments

Comments
 (0)