Skip to content

Commit 068cc96

Browse files
committed
GH-48868: [Doc] Document security model for the Arrow formats
1 parent 8010794 commit 068cc96

3 files changed

Lines changed: 254 additions & 0 deletions

File tree

docs/source/format/CanonicalExtensions.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -285,6 +285,8 @@ UUID
285285
A specific UUID version is not required or guaranteed. This extension represents
286286
UUIDs as FixedSizeBinary(16) with big-endian notation and does not interpret the bytes in any way.
287287

288+
.. _opaque_extension:
289+
288290
Opaque
289291
======
290292

docs/source/format/Security.rst

Lines changed: 251 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,251 @@
1+
.. Licensed to the Apache Software Foundation (ASF) under one
2+
.. or more contributor license agreements. See the NOTICE file
3+
.. distributed with this work for additional information
4+
.. regarding copyright ownership. The ASF licenses this file
5+
.. to you under the Apache License, Version 2.0 (the
6+
.. "License"); you may not use this file except in compliance
7+
.. with the License. You may obtain a copy of the License at
8+
9+
.. http://www.apache.org/licenses/LICENSE-2.0
10+
11+
.. Unless required by applicable law or agreed to in writing,
12+
.. software distributed under the License is distributed on an
13+
.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
14+
.. KIND, either express or implied. See the License for the
15+
.. specific language governing permissions and limitations
16+
.. under the License.
17+
18+
.. _format_security:
19+
20+
***********************
21+
Security Considerations
22+
***********************
23+
24+
This document describes potential security concerns with the various Arrow
25+
specifications in contexts where data cannot be fully trusted.
26+
27+
28+
Who should read this
29+
====================
30+
31+
This document targets two categories of readers:
32+
33+
1. *implementors* of Arrow libraries: that is, libraries that provide APIs
34+
abstraction away from the details of the Arrow formats and protocols; such
35+
libraries include the official Arrow implementations documented on
36+
https://arrow.apache.org, but not only.
37+
38+
2. *users* of Arrow: that is, developers of third-party libraries or applications
39+
that use some of the Arrow formats or protocols by calling into Arrow libraries
40+
as defined above.
41+
42+
43+
Columnar Format
44+
===============
45+
46+
Invalid data
47+
------------
48+
49+
The Arrow :ref:`columnar format <_format_columnar>` is an efficient binary
50+
representation with a focus on performance and efficiency. While the format
51+
does not store raw pointers, the contents of Arrow buffers are often
52+
combined and converted to pointers into the process' address space.
53+
Invalid Arrow data may therefore cause invalid memory accesses
54+
(potentially crashing the process) or access to non-Arrow data
55+
(potentially allowing an attacker to exfiltrate confidential information).
56+
57+
For instance, to read a value from a Binary array, you need to 1) read the
58+
values' offsets from array buffer #2, and 2) read the range of bytes
59+
delimited by these offsets in array buffer #3. If the offsets are invalid
60+
(deliberately or not), then step 2) can access memory outside of the buffers'
61+
range.
62+
63+
Another instance of invalid data lies in the values themselves. For example,
64+
a String array is only allowed to contain valid UTF-8 data, but an untrusted
65+
source might have emitted invalid UTF-8 under the disguise of a String array.
66+
An unsuspecting algorithm that is only specified for valid UTF-8 inputs might
67+
lead to dangerous behavior (for example by reading memory out of bounds when
68+
looking for an UTF-8 character boundary).
69+
70+
Fortunately, knowing its schema, it is possible to validate Arrow data up front,
71+
so that reading this data will not pose any danger later on.
72+
73+
.. TODO:
74+
For each layout, we should list the associated security risks and the recommended
75+
steps to validate (perhaps in Columnar.rst)
76+
77+
Advice for users
78+
''''''''''''''''
79+
80+
Arrow implementations often assume inputs follow the specification to provide
81+
high speed processing. It is **extremely recommended** that your application
82+
either trusts or validates the Arrow data it receives from other sources. Many
83+
Arrow implementations provide APIs to validate Arrow data for soundness.
84+
85+
.. TODO: link to some validation APIs for the main implementations here?
86+
87+
Advice for implementors
88+
'''''''''''''''''''''''
89+
90+
It is **recommended** that you provide dedicated APIs to validate Arrow arrays
91+
and/or record batches. Users will be able to utilize those APIs to assert whether
92+
data coming from untrusted sources can be safely accessed.
93+
94+
A typical validation API must return a well-defined error, not crash, if the
95+
given Arrow data is invalid; it must always be safe to execute regardless of
96+
whether the data is valid or not.
97+
98+
Uninitialized data
99+
------------------
100+
101+
A less obvious pitfall is when some parts of an Arrow array are left uninitialized.
102+
For example, if an element of a primitive Arrow array is marked null through its
103+
validity bitmap, the corresponding value slot in the values buffer can be ignored
104+
for all purposes. It is therefore tempting, when creating an array with null
105+
values, to not initialize the corresponding value slots.
106+
107+
However, this then introduces a serious security risk if the Arrow data is
108+
serialized and published (e.g. using IPC or Flight) such that it can be
109+
accessed by untrusted users. Indeed, the uninitialized value slot can
110+
reveal data left by a previous memory allocation made in the same process.
111+
Depending on the application, this data could contain confidential information.
112+
113+
Advice for users and implementors
114+
'''''''''''''''''''''''''''''''''
115+
116+
When creating a Arrow array, it is **recommended** that you never leave any
117+
data uninitialized in a buffer if the array might be sent to, or read by, an
118+
untrusted third-party, even when the uninitialized data is logically
119+
irrelevant. The easiest way to do this is to zero-initialize any buffer that
120+
will not be populated in full.
121+
122+
If it is determined, through benchmarking, that zero-initialization imposes
123+
an excessive performance cost, a library or application may instead decide
124+
to use uninitialized memory internally as an optimization; but it should then
125+
ensure all such uninitialized values are cleared before passing the Arrow data
126+
to another system.
127+
128+
.. note::
129+
Sending Arrow data out of the current process can happen *indirectly*,
130+
for example if you produce it over the C Data Interface and the consumer
131+
persists it using the IPC format on some public storage.
132+
133+
134+
C Data Interface
135+
================
136+
137+
The C Data Interface contains raw pointers into the process' address space.
138+
It is generally not possible to validate that those pointers are legitimate;
139+
read from such a pointer may crash or access unrelated or bogus data.
140+
141+
Advice for users
142+
''''''''''''''''
143+
144+
You should **never** consume a C Data Interface structure from an untrusted
145+
producer, as it is by construction impossible to guard against dangerous
146+
behavior in this case.
147+
148+
Advice for implementors
149+
'''''''''''''''''''''''
150+
151+
When consuming a C Data Interface structure, you can assume that it comes from
152+
a trusted producer, for the reason explained above. However, it is still
153+
**recommended** that you cursorily validate it for soundness (for example that
154+
the right number of buffers is passed for a given datatype), as a trusted producer
155+
can have bugs anyway.
156+
157+
158+
IPC Format
159+
==========
160+
161+
The :ref:`IPC format <_ipc-message-format>` is a serialization format for the
162+
columnar format with associated metadata. Reading an IPC stream or file from
163+
an untrusted source comes with similar caveats as reading the Arrow columnar
164+
format.
165+
166+
The additional signalisation and metadata in the IPC format come with
167+
their own risks. For example, buffer offsets and sizes encoded in IPC messages
168+
may be out of bounds for the IPC stream; Flatbuffers-encoded metadata payloads
169+
may carry incorrect offsets pointing outside of the designated metadata area.
170+
171+
Advice for users
172+
''''''''''''''''
173+
174+
Your Arrow library choice should be safe against an invalid IPC stream. However,
175+
if the IPC stream is structurally valid, it will typically not validate the Arrow
176+
data itself. Therefore, as with the columnar format, it is **extremely recommended**
177+
that you validate the Arrow data read from an untrusted IPC stream.
178+
179+
Advice for implementors
180+
'''''''''''''''''''''''
181+
182+
It is **extremely recommended** to run dedicated validation checks when decoding
183+
the IPC format, to make sure that the decoding can not induce unwanted behavior.
184+
Failing those checks should return a well-known error to the caller, not crash.
185+
186+
187+
Extension Types
188+
===============
189+
190+
Extension types typically register a custom deserialization hook so that they
191+
can be automatically recreated when reading from an external source (for example
192+
using IPC). The deserialization hook has to decode the extension type's parameters
193+
from a string or binary payload specific to the extension type.
194+
:ref:`Typical examples <opaque_extension>` use a bespoke JSON representation
195+
with object fields representing the various parameters.
196+
197+
When reading data from an untrusted source, any registered deserialization hook
198+
could be called with an arbitrary payload. It is therefore of primary importance
199+
that the hook be safe to call on invalid, potentially malicious, data. This mandates
200+
the use of a robust metadata serialization schema (such as JSON, but not Python's
201+
[``pickle``](https://docs.python.org/3/library/pickle.html), for example).
202+
203+
Advice for users and implementors
204+
'''''''''''''''''''''''''''''''''
205+
206+
When designing an extension type, it is **extremely recommended** to choose a
207+
metadata serialization format that is robust against potentially malicious
208+
data.
209+
210+
When implementing an extension type, it is **recommended** to ensure that the
211+
deserialization hook is able to detect, and error out, if the serialized metadata
212+
payload is invalid.
213+
214+
215+
Testing for robustness
216+
======================
217+
218+
Advice for implementors
219+
'''''''''''''''''''''''
220+
221+
For APIs that may process untrusted inputs, it is **extremely recommended**
222+
that your unit tests exercise your APIs against typical kinds of invalid data.
223+
For example, your validation APIs will have to be tested against invalid Binary
224+
or List offsets, invalid UTF-8 data in a String array, etc.
225+
226+
It is **recommended** that you go one step further and set up some kind of
227+
automated robustness testing against unforeseen inputs. One typical approach
228+
is though fuzzing, possibly coupled with a runtime instrumentation framework
229+
that detects dangerous behavior (such as Address Sanitizer in C++ or
230+
Rust).
231+
232+
A reasonable way of setting up fuzzing for Arrow is using the IPC format as
233+
a binary payload; the fuzz target should not only attempt to decode the IPC
234+
stream as Arrow data, but it should then validate the Arrow data.
235+
This will strengthen both the IPC decoder and the validation routines
236+
against invalid, potentially malicious data. Finally, if validation comes out
237+
successfully, the fuzz target may exercise some important core functionality,
238+
such as printing the data for human display; this will help ensure that the
239+
validation routine did not let through invalid data that may lead to dangerous
240+
behavior.
241+
242+
243+
Non-Arrow formats and protocols
244+
===============================
245+
246+
Arrow data can also be sent or stored using third-party formats such as Apache
247+
Parquet. Those formats may or may not present the same security risks as listed
248+
above (for example, the precautions around uninitialized data may not apply
249+
in a format like Parquet that does not create any value slots for null elements).
250+
We suggest you refer to these projects' own documentation for more concrete
251+
guidelines.

docs/source/format/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -37,5 +37,6 @@ Specifications
3737
Flight
3838
FlightSql
3939
ADBC
40+
Security
4041
Integration
4142
Glossary

0 commit comments

Comments
 (0)