GH-48868: [Doc] Document security model for the Arrow formats (#48870)

pitrou · web-flow · commit f39f2758df9a · 2026-02-05T16:39:04.000+01:00
### Rationale for this change Accessing Arrow data or any of the formats can have non-trivial security implications, this is an attempt at documenting those. ### What changes are included in this PR? Add a Security Considerations page in the Format section. **Doc preview:** https://s3.amazonaws.com/arrow-data/pr_docs/48870/format/Security.html ### Are these changes tested? N/A ### Are there any user-facing changes? No. * GitHub Issue: #48868 Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>
diff --git a/docs/source/developers/cpp/fuzzing.rst b/docs/source/developers/cpp/fuzzing.rst
@@ -26,10 +26,10 @@ Fuzzing Arrow C++
 To make the handling of invalid input more robust, we have enabled
 fuzz testing on several parts of the Arrow C++ feature set, currently:
 
-* the IPC stream format
-* the IPC file format
-* the Parquet file format
-* the CSV file format
+* the IPC stream reader
+* the IPC file reader
+* the Parquet file reader
+* the CSV file reader
 
 We welcome any contribution to expand the scope of fuzz testing and cover
 areas ingesting potentially invalid or malicious data.
@@ -110,3 +110,23 @@ dependencies, you may need to install these before building the fuzz targets:
 
    $ conda install clang clangxx compiler-rt
    $ cmake .. --preset=fuzzing
+
+
+.. _fuzz-regression-files:
+
+Regression files
+================
+
+When a fuzzer-detected bug is found and fixed, the corresponding reproducer
+must be stored in the `arrow-testing <https://github.com/apache/arrow-testing/>`__
+repository to ensure that the code doesn't regress.
+
+The locations for these files are as follows:
+
+* IPC streams: in the ``data/arrow-ipc-stream`` `directory <https://github.com/apache/arrow-testing/tree/master/data/arrow-ipc-stream>`__
+* IPC files: in the ``data/arrow-ipc-file`` `directory <https://github.com/apache/arrow-testing/tree/master/data/arrow-ipc-file>`__
+* Parquet files: in the ``data/parquet/fuzzing`` `directory <https://github.com/apache/arrow-testing/tree/master/data/parquet/fuzzing>`__
+* CSV files: in the ``data/csv/fuzzing`` `directory <https://github.com/apache/arrow-testing/tree/master/data/csv/fuzzing>`__
+
+Most of those files are invalid files for their respective formats and stress
+proper error detection and reporting in the implementation code.
diff --git a/docs/source/format/CanonicalExtensions.rst b/docs/source/format/CanonicalExtensions.rst
@@ -285,6 +285,8 @@ UUID
    A specific UUID version is not required or guaranteed. This extension represents
    UUIDs as FixedSizeBinary(16) with big-endian notation and does not interpret the bytes in any way.
 
+.. _opaque_extension:
+
 Opaque
 ======
 
diff --git a/docs/source/format/Integration.rst b/docs/source/format/Integration.rst
@@ -561,6 +561,8 @@ in ``datagen.py``):
 * Extension Types
 
 
+.. _format-gold-integration-files:
+
 Gold File Integration Tests
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
diff --git a/docs/source/format/Security.rst b/docs/source/format/Security.rst
@@ -0,0 +1,277 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. _format_security:
+
+***********************
+Security Considerations
+***********************
+
+This document describes security considerations when reading Arrow
+data from untrusted sources. It focuses specifically on data passed in a
+standardized serialized form (such as a IPC stream), as opposed to an
+implementation-specific native representation (such as ``arrow::Array`` in C++).
+
+.. note::
+   Implementation-specific concerns, such as bad API usage, are out of scope
+   for this document. Please refer to the implementation's own documentation.
+
+
+Who should read this
+====================
+
+You should read this document if you belong to either of these two categories:
+
+1. *users* of Arrow: that is, developers of third-party libraries or applications
+   that don't directly implement the Arrow formats or protocols, but instead
+   call language-specific APIs provided by an Arrow library (as defined below);
+
+2. *implementors* of Arrow libraries: that is, libraries that provide APIs
+   abstracting away from the details of the Arrow formats and protocols; such
+   libraries include, but are not limited to, the official Arrow implementations
+   documented on https://arrow.apache.org.
+
+
+Columnar Format
+===============
+
+Invalid data
+------------
+
+The Arrow :ref:`columnar format <_format_columnar>` is an efficient binary
+representation with a focus on performance and efficiency. While the format
+does not store raw pointers, the contents of Arrow buffers are often
+combined and converted to pointers into the process' address space.
+Invalid Arrow data may therefore cause invalid memory accesses
+(potentially crashing the process) or access to non-Arrow data
+(potentially allowing an attacker to exfiltrate confidential information).
+
+For instance, to read a value from a Binary array, one needs to 1) read the
+values' offsets from the array's offsets buffer, and 2) read the range of bytes
+delimited by these offsets in the array's data buffer. If the offsets are
+invalid (deliberately or not), then step 2) can access memory outside of the
+data buffer's range.
+
+Another instance of invalid data lies in the values themselves. For example,
+a String array is only allowed to contain valid UTF-8 data, but an untrusted
+source might have emitted invalid UTF-8 under the disguise of a String array.
+An unsuspecting algorithm that is only specified for valid UTF-8 inputs might
+lead to dangerous behavior (for example by reading memory out of bounds when
+looking for an UTF-8 character boundary).
+
+Fortunately, knowing its schema, it is possible to validate Arrow data up front,
+so that reading this data will not pose any danger later on.
+
+.. TODO:
+   For each layout, we should list the associated security risks and the recommended
+   steps to validate (perhaps in Columnar.rst)
+
+Advice for users
+''''''''''''''''
+
+Arrow implementations often assume inputs follow the specification to provide
+high speed processing. It is **extremely recommended** that your application
+explicitly validates any Arrow data it receives under serialized form
+from untrusted sources. Many Arrow implementations provide explicit APIs to
+perform such validation.
+
+.. TODO: link to some validation APIs for the main implementations here?
+
+Advice for implementors
+'''''''''''''''''''''''
+
+It is **recommended** that you provide dedicated APIs to validate Arrow arrays
+and/or record batches. Users will be able to utilize those APIs to assert whether
+data coming from untrusted sources can be safely accessed.
+
+A typical validation API must return a well-defined error, not crash, if the
+given Arrow data is invalid; it must always be safe to execute regardless of
+whether the data is valid or not.
+
+Uninitialized data
+------------------
+
+A less obvious pitfall is when some parts of an Arrow array are left uninitialized.
+For example, if an element of a primitive Arrow array is marked null through its
+validity bitmap, the corresponding value slot in the values buffer can be ignored
+for all purposes. It is therefore tempting, when creating an array with null
+values, to not initialize the corresponding value slots.
+
+However, this then introduces a serious security risk if the Arrow data is
+serialized and published (e.g. using IPC or Flight) such that it can be
+accessed by untrusted users. Indeed, the uninitialized value slot can
+reveal data left by a previous memory allocation made in the same process.
+Depending on the application, this data could contain confidential information.
+
+Advice for users and implementors
+'''''''''''''''''''''''''''''''''
+
+When creating a Arrow array, it is **recommended** that you never leave any
+data uninitialized in a buffer if the array might be sent to, or read by, an
+untrusted third-party, even when the uninitialized data is logically
+irrelevant. The easiest way to do this is to zero-initialize any buffer that
+will not be populated in full.
+
+If it is determined, through benchmarking, that zero-initialization imposes
+an excessive performance cost, a library or application may instead decide
+to use uninitialized memory internally as an optimization; but it should then
+ensure all such uninitialized values are cleared before passing the Arrow data
+to another system.
+
+.. note::
+   Sending Arrow data out of the current process can happen *indirectly*,
+   for example if you produce it over the C Data Interface and the consumer
+   persists it using the IPC format on some public storage.
+
+
+C Data Interface
+================
+
+The C Data Interface contains raw pointers into the process' address space.
+It is generally not possible to validate that those pointers are legitimate;
+read from such a pointer may crash or access unrelated or bogus data.
+
+Advice for users
+----------------
+
+You should **never** consume a C Data Interface structure from an untrusted
+producer, as it is by construction impossible to guard against dangerous
+behavior in this case.
+
+Advice for implementors
+-----------------------
+
+When consuming a C Data Interface structure, you can assume that it comes from
+a trusted producer, for the reason explained above. However, it is still
+**recommended** that you validate it for soundness (for example that the right
+number of buffers is passed for a given datatype), as a trusted producer can
+have bugs anyway.
+
+
+IPC Format
+==========
+
+The :ref:`IPC format <_ipc-message-format>` is a serialization format for the
+columnar format with associated metadata. Reading an IPC stream or file from
+an untrusted source comes with similar caveats as reading the Arrow columnar
+format.
+
+The additional signalisation and metadata in the IPC format come with
+their own risks. For example, buffer offsets and sizes encoded in IPC messages
+may be out of bounds for the IPC stream; Flatbuffers-encoded metadata payloads
+may carry incorrect offsets pointing outside of the designated metadata area.
+
+Advice for users
+----------------
+
+Arrow libraries will typically ensure IPC streams are structurally valid
+but may not also validate the underlying Array data. It is **extremely recommended**
+that you use the appropriate APIs to validate the Arrow data read from an untrusted IPC stream.
+
+Advice for implementors
+-----------------------
+
+It is **extremely recommended** to run dedicated validation checks when decoding
+the IPC format, to make sure that the decoding can not induce unwanted behavior.
+Failing those checks should return a well-known error to the caller, not crash.
+
+
+Extension Types
+===============
+
+Extension types typically register a custom deserialization hook so that they
+can be automatically recreated when reading from an external source (for example
+using IPC). The deserialization hook has to decode the extension type's parameters
+from a string or binary payload specific to the extension type.
+:ref:`Typical examples <opaque_extension>` use a bespoke JSON representation
+with object fields representing the various parameters.
+
+When reading data from an untrusted source, any registered deserialization hook
+could be called with an arbitrary payload. It is therefore of primary importance
+that the hook be safe to call on invalid, potentially malicious, data. This mandates
+the use of a robust metadata serialization schema (such as JSON, but not Python's
+`pickle <https://docs.python.org/3/library/pickle.html>`__ or R's
+`serialize() <https://stat.ethz.ch/R-manual/R-devel/library/base/html/serialize.html>`__,
+for example).
+
+Advice for users and implementors
+---------------------------------
+
+When designing an extension type, it is **extremely recommended** to choose a
+metadata serialization format that is robust against potentially malicious
+data.
+
+When implementing an extension type, it is **recommended** to ensure that the
+deserialization hook is able to detect, and error out gracefully, if the
+serialized metadata payload is invalid.
+
+
+Testing for robustness
+======================
+
+Advice for implementors
+-----------------------
+
+For APIs that may process untrusted inputs, it is **extremely recommended**
+that your unit tests exercise your APIs against typical kinds of invalid data.
+For example, your validation APIs will have to be tested against invalid Binary
+or List offsets, invalid UTF-8 data in a String array, etc.
+
+Testing against known regression files
+''''''''''''''''''''''''''''''''''''''
+
+The `arrow-testing <https://github.com/apache/arrow-testing/>`__ repository
+contains regression files for various formats, such as the IPC format.
+
+Two categories of files are especially noteworthy and can serve to exercise
+an Arrow implementation's robustness:
+
+1. :ref:`gold integration files <format-gold-integration-files>` that are valid
+   files to exercise compliance with Arrow IPC features;
+2. :ref:`fuzz regression files <fuzz-regression-files>` that have been automatically
+   generated each time a fuzzer founds a bug triggered by a specific (usually invalid)
+   input for a given format.
+
+Fuzzing
+'''''''
+
+It is **recommended** that you go one step further and set up some kind of
+automated robustness testing against unforeseen inputs. One typical approach
+is though fuzzing, possibly coupled with a runtime instrumentation framework
+that detects dangerous behavior (such as Address Sanitizer in C++ or
+Rust).
+
+A reasonable way of setting up fuzzing for Arrow is using the IPC format as
+a binary payload; the fuzz target should not only attempt to decode the IPC
+stream as Arrow data, but it should then validate the Arrow data.
+This will strengthen both the IPC decoder and the validation routines
+against invalid, potentially malicious data. Finally, if validation comes out
+successfully, the fuzz target may exercise some important core functionality,
+such as printing the data for human display; this will help ensure that the
+validation routine did not let through invalid data that may lead to dangerous
+behavior.
+
+
+Non-Arrow formats and protocols
+===============================
+
+Arrow data can also be sent or stored using third-party formats such as Apache
+Parquet. Those formats may or may not present the same security risks as listed
+above (for example, the precautions around uninitialized data may not apply
+in a format like Parquet that does not create any value slots for null elements).
+We suggest you refer to these projects' own documentation for more concrete
+guidelines.
diff --git a/docs/source/format/index.rst b/docs/source/format/index.rst
@@ -37,5 +37,6 @@ Specifications
    Flight
    FlightSql
    ADBC
+   Security
    Integration
    Glossary