|
| 1 | +.. Licensed to the Apache Software Foundation (ASF) under one |
| 2 | +.. or more contributor license agreements. See the NOTICE file |
| 3 | +.. distributed with this work for additional information |
| 4 | +.. regarding copyright ownership. The ASF licenses this file |
| 5 | +.. to you under the Apache License, Version 2.0 (the |
| 6 | +.. "License"); you may not use this file except in compliance |
| 7 | +.. with the License. You may obtain a copy of the License at |
| 8 | +
|
| 9 | +.. http://www.apache.org/licenses/LICENSE-2.0 |
| 10 | +
|
| 11 | +.. Unless required by applicable law or agreed to in writing, |
| 12 | +.. software distributed under the License is distributed on an |
| 13 | +.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| 14 | +.. KIND, either express or implied. See the License for the |
| 15 | +.. specific language governing permissions and limitations |
| 16 | +.. under the License. |
| 17 | +
|
| 18 | +.. _format_security: |
| 19 | + |
| 20 | +*********************** |
| 21 | +Security Considerations |
| 22 | +*********************** |
| 23 | + |
| 24 | +This document describes security considerations when reading Arrow |
| 25 | +data from untrusted sources. It focuses specifically on data passed in a |
| 26 | +standardized serialized form (such as a IPC stream), as opposed to an |
| 27 | +implementation-specific native representation (such as ``arrow::Array`` in C++). |
| 28 | + |
| 29 | +.. note:: |
| 30 | + Implementation-specific concerns, such as bad API usage, are out of scope |
| 31 | + for this document. Please refer to the implementation's own documentation. |
| 32 | + |
| 33 | + |
| 34 | +Who should read this |
| 35 | +==================== |
| 36 | + |
| 37 | +You should read this document if you belong to either of these two categories: |
| 38 | + |
| 39 | +1. *users* of Arrow: that is, developers of third-party libraries or applications |
| 40 | + that don't directly implement the Arrow formats or protocols, but instead |
| 41 | + call language-specific APIs provided by an Arrow library (as defined below); |
| 42 | + |
| 43 | +2. *implementors* of Arrow libraries: that is, libraries that provide APIs |
| 44 | + abstracting away from the details of the Arrow formats and protocols; such |
| 45 | + libraries include, but are not limited to, the official Arrow implementations |
| 46 | + documented on https://arrow.apache.org. |
| 47 | + |
| 48 | + |
| 49 | +Columnar Format |
| 50 | +=============== |
| 51 | + |
| 52 | +Invalid data |
| 53 | +------------ |
| 54 | + |
| 55 | +The Arrow :ref:`columnar format <_format_columnar>` is an efficient binary |
| 56 | +representation with a focus on performance and efficiency. While the format |
| 57 | +does not store raw pointers, the contents of Arrow buffers are often |
| 58 | +combined and converted to pointers into the process' address space. |
| 59 | +Invalid Arrow data may therefore cause invalid memory accesses |
| 60 | +(potentially crashing the process) or access to non-Arrow data |
| 61 | +(potentially allowing an attacker to exfiltrate confidential information). |
| 62 | + |
| 63 | +For instance, to read a value from a Binary array, one needs to 1) read the |
| 64 | +values' offsets from the array's offsets buffer, and 2) read the range of bytes |
| 65 | +delimited by these offsets in the array's data buffer. If the offsets are |
| 66 | +invalid (deliberately or not), then step 2) can access memory outside of the |
| 67 | +data buffer's range. |
| 68 | + |
| 69 | +Another instance of invalid data lies in the values themselves. For example, |
| 70 | +a String array is only allowed to contain valid UTF-8 data, but an untrusted |
| 71 | +source might have emitted invalid UTF-8 under the disguise of a String array. |
| 72 | +An unsuspecting algorithm that is only specified for valid UTF-8 inputs might |
| 73 | +lead to dangerous behavior (for example by reading memory out of bounds when |
| 74 | +looking for an UTF-8 character boundary). |
| 75 | + |
| 76 | +Fortunately, knowing its schema, it is possible to validate Arrow data up front, |
| 77 | +so that reading this data will not pose any danger later on. |
| 78 | + |
| 79 | +.. TODO: |
| 80 | + For each layout, we should list the associated security risks and the recommended |
| 81 | + steps to validate (perhaps in Columnar.rst) |
| 82 | +
|
| 83 | +Advice for users |
| 84 | +'''''''''''''''' |
| 85 | + |
| 86 | +Arrow implementations often assume inputs follow the specification to provide |
| 87 | +high speed processing. It is **extremely recommended** that your application |
| 88 | +explicitly validates any Arrow data it receives under serialized form |
| 89 | +from untrusted sources. Many Arrow implementations provide explicit APIs to |
| 90 | +perform such validation. |
| 91 | + |
| 92 | +.. TODO: link to some validation APIs for the main implementations here? |
| 93 | +
|
| 94 | +Advice for implementors |
| 95 | +''''''''''''''''''''''' |
| 96 | + |
| 97 | +It is **recommended** that you provide dedicated APIs to validate Arrow arrays |
| 98 | +and/or record batches. Users will be able to utilize those APIs to assert whether |
| 99 | +data coming from untrusted sources can be safely accessed. |
| 100 | + |
| 101 | +A typical validation API must return a well-defined error, not crash, if the |
| 102 | +given Arrow data is invalid; it must always be safe to execute regardless of |
| 103 | +whether the data is valid or not. |
| 104 | + |
| 105 | +Uninitialized data |
| 106 | +------------------ |
| 107 | + |
| 108 | +A less obvious pitfall is when some parts of an Arrow array are left uninitialized. |
| 109 | +For example, if an element of a primitive Arrow array is marked null through its |
| 110 | +validity bitmap, the corresponding value slot in the values buffer can be ignored |
| 111 | +for all purposes. It is therefore tempting, when creating an array with null |
| 112 | +values, to not initialize the corresponding value slots. |
| 113 | + |
| 114 | +However, this then introduces a serious security risk if the Arrow data is |
| 115 | +serialized and published (e.g. using IPC or Flight) such that it can be |
| 116 | +accessed by untrusted users. Indeed, the uninitialized value slot can |
| 117 | +reveal data left by a previous memory allocation made in the same process. |
| 118 | +Depending on the application, this data could contain confidential information. |
| 119 | + |
| 120 | +Advice for users and implementors |
| 121 | +''''''''''''''''''''''''''''''''' |
| 122 | + |
| 123 | +When creating a Arrow array, it is **recommended** that you never leave any |
| 124 | +data uninitialized in a buffer if the array might be sent to, or read by, an |
| 125 | +untrusted third-party, even when the uninitialized data is logically |
| 126 | +irrelevant. The easiest way to do this is to zero-initialize any buffer that |
| 127 | +will not be populated in full. |
| 128 | + |
| 129 | +If it is determined, through benchmarking, that zero-initialization imposes |
| 130 | +an excessive performance cost, a library or application may instead decide |
| 131 | +to use uninitialized memory internally as an optimization; but it should then |
| 132 | +ensure all such uninitialized values are cleared before passing the Arrow data |
| 133 | +to another system. |
| 134 | + |
| 135 | +.. note:: |
| 136 | + Sending Arrow data out of the current process can happen *indirectly*, |
| 137 | + for example if you produce it over the C Data Interface and the consumer |
| 138 | + persists it using the IPC format on some public storage. |
| 139 | + |
| 140 | + |
| 141 | +C Data Interface |
| 142 | +================ |
| 143 | + |
| 144 | +The C Data Interface contains raw pointers into the process' address space. |
| 145 | +It is generally not possible to validate that those pointers are legitimate; |
| 146 | +read from such a pointer may crash or access unrelated or bogus data. |
| 147 | + |
| 148 | +Advice for users |
| 149 | +---------------- |
| 150 | + |
| 151 | +You should **never** consume a C Data Interface structure from an untrusted |
| 152 | +producer, as it is by construction impossible to guard against dangerous |
| 153 | +behavior in this case. |
| 154 | + |
| 155 | +Advice for implementors |
| 156 | +----------------------- |
| 157 | + |
| 158 | +When consuming a C Data Interface structure, you can assume that it comes from |
| 159 | +a trusted producer, for the reason explained above. However, it is still |
| 160 | +**recommended** that you validate it for soundness (for example that the right |
| 161 | +number of buffers is passed for a given datatype), as a trusted producer can |
| 162 | +have bugs anyway. |
| 163 | + |
| 164 | + |
| 165 | +IPC Format |
| 166 | +========== |
| 167 | + |
| 168 | +The :ref:`IPC format <_ipc-message-format>` is a serialization format for the |
| 169 | +columnar format with associated metadata. Reading an IPC stream or file from |
| 170 | +an untrusted source comes with similar caveats as reading the Arrow columnar |
| 171 | +format. |
| 172 | + |
| 173 | +The additional signalisation and metadata in the IPC format come with |
| 174 | +their own risks. For example, buffer offsets and sizes encoded in IPC messages |
| 175 | +may be out of bounds for the IPC stream; Flatbuffers-encoded metadata payloads |
| 176 | +may carry incorrect offsets pointing outside of the designated metadata area. |
| 177 | + |
| 178 | +Advice for users |
| 179 | +---------------- |
| 180 | + |
| 181 | +Arrow libraries will typically ensure IPC streams are structurally valid |
| 182 | +but may not also validate the underlying Array data. It is **extremely recommended** |
| 183 | +that you use the appropriate APIs to validate the Arrow data read from an untrusted IPC stream. |
| 184 | + |
| 185 | +Advice for implementors |
| 186 | +----------------------- |
| 187 | + |
| 188 | +It is **extremely recommended** to run dedicated validation checks when decoding |
| 189 | +the IPC format, to make sure that the decoding can not induce unwanted behavior. |
| 190 | +Failing those checks should return a well-known error to the caller, not crash. |
| 191 | + |
| 192 | + |
| 193 | +Extension Types |
| 194 | +=============== |
| 195 | + |
| 196 | +Extension types typically register a custom deserialization hook so that they |
| 197 | +can be automatically recreated when reading from an external source (for example |
| 198 | +using IPC). The deserialization hook has to decode the extension type's parameters |
| 199 | +from a string or binary payload specific to the extension type. |
| 200 | +:ref:`Typical examples <opaque_extension>` use a bespoke JSON representation |
| 201 | +with object fields representing the various parameters. |
| 202 | + |
| 203 | +When reading data from an untrusted source, any registered deserialization hook |
| 204 | +could be called with an arbitrary payload. It is therefore of primary importance |
| 205 | +that the hook be safe to call on invalid, potentially malicious, data. This mandates |
| 206 | +the use of a robust metadata serialization schema (such as JSON, but not Python's |
| 207 | +`pickle <https://docs.python.org/3/library/pickle.html>`__ or R's |
| 208 | +`serialize() <https://stat.ethz.ch/R-manual/R-devel/library/base/html/serialize.html>`__, |
| 209 | +for example). |
| 210 | + |
| 211 | +Advice for users and implementors |
| 212 | +--------------------------------- |
| 213 | + |
| 214 | +When designing an extension type, it is **extremely recommended** to choose a |
| 215 | +metadata serialization format that is robust against potentially malicious |
| 216 | +data. |
| 217 | + |
| 218 | +When implementing an extension type, it is **recommended** to ensure that the |
| 219 | +deserialization hook is able to detect, and error out gracefully, if the |
| 220 | +serialized metadata payload is invalid. |
| 221 | + |
| 222 | + |
| 223 | +Testing for robustness |
| 224 | +====================== |
| 225 | + |
| 226 | +Advice for implementors |
| 227 | +----------------------- |
| 228 | + |
| 229 | +For APIs that may process untrusted inputs, it is **extremely recommended** |
| 230 | +that your unit tests exercise your APIs against typical kinds of invalid data. |
| 231 | +For example, your validation APIs will have to be tested against invalid Binary |
| 232 | +or List offsets, invalid UTF-8 data in a String array, etc. |
| 233 | + |
| 234 | +Testing against known regression files |
| 235 | +'''''''''''''''''''''''''''''''''''''' |
| 236 | + |
| 237 | +The `arrow-testing <https://github.com/apache/arrow-testing/>`__ repository |
| 238 | +contains regression files for various formats, such as the IPC format. |
| 239 | + |
| 240 | +Two categories of files are especially noteworthy and can serve to exercise |
| 241 | +an Arrow implementation's robustness: |
| 242 | + |
| 243 | +1. :ref:`gold integration files <format-gold-integration-files>` that are valid |
| 244 | + files to exercise compliance with Arrow IPC features; |
| 245 | +2. :ref:`fuzz regression files <fuzz-regression-files>` that have been automatically |
| 246 | + generated each time a fuzzer founds a bug triggered by a specific (usually invalid) |
| 247 | + input for a given format. |
| 248 | + |
| 249 | +Fuzzing |
| 250 | +''''''' |
| 251 | + |
| 252 | +It is **recommended** that you go one step further and set up some kind of |
| 253 | +automated robustness testing against unforeseen inputs. One typical approach |
| 254 | +is though fuzzing, possibly coupled with a runtime instrumentation framework |
| 255 | +that detects dangerous behavior (such as Address Sanitizer in C++ or |
| 256 | +Rust). |
| 257 | + |
| 258 | +A reasonable way of setting up fuzzing for Arrow is using the IPC format as |
| 259 | +a binary payload; the fuzz target should not only attempt to decode the IPC |
| 260 | +stream as Arrow data, but it should then validate the Arrow data. |
| 261 | +This will strengthen both the IPC decoder and the validation routines |
| 262 | +against invalid, potentially malicious data. Finally, if validation comes out |
| 263 | +successfully, the fuzz target may exercise some important core functionality, |
| 264 | +such as printing the data for human display; this will help ensure that the |
| 265 | +validation routine did not let through invalid data that may lead to dangerous |
| 266 | +behavior. |
| 267 | + |
| 268 | + |
| 269 | +Non-Arrow formats and protocols |
| 270 | +=============================== |
| 271 | + |
| 272 | +Arrow data can also be sent or stored using third-party formats such as Apache |
| 273 | +Parquet. Those formats may or may not present the same security risks as listed |
| 274 | +above (for example, the precautions around uninitialized data may not apply |
| 275 | +in a format like Parquet that does not create any value slots for null elements). |
| 276 | +We suggest you refer to these projects' own documentation for more concrete |
| 277 | +guidelines. |
0 commit comments