Skip to content

feat(encryption): parquet encryption via delta.encryption.* table properties#4

Draft
corwinjoy wants to merge 10 commits into
G-Research-Forks:mainfrom
corwinjoy:encryption-table-properties
Draft

feat(encryption): parquet encryption via delta.encryption.* table properties#4
corwinjoy wants to merge 10 commits into
G-Research-Forks:mainfrom
corwinjoy:encryption-table-properties

Conversation

@corwinjoy
Copy link
Copy Markdown

Summary

Implements Parquet encryption driven by standard delta.encryption.* table properties stored in the delta log, following the Modular Encryption in Delta Lake RFC.

Key difference from previous approach: encryption is configured as table properties at creation time, not as a runtime API parameter. Once set, all operations (write, read, delete, update, merge, optimize) encrypt/decrypt automatically.

User-facing API

// 1. Register KMS factory once at startup
register_encryption_factory("my-kms", Arc::new(MyKmsFactory::new()));

// 2. Create encrypted table
table.create()
    .with_property("delta.encryption.kms.id",     "my-kms")
    .with_property("delta.encryption.footer.key",  "master-key-id")
    .with_property("delta.encryption.column.keys", "col-key:ssn,dob")
    .await?;

// 3. All subsequent operations are automatically encrypted/decrypted
table.write(batches).await?;   // encrypted
table.delete()...await?;       // decrypted read + encrypted write
table.scan_table().execute()?; // decrypted

Design

  • EncryptionConfig parsed from TableProperties.unknown_properties follows the same from_config(table_config) pattern as WriterStatsConfig
  • WriterPropertiesFactory async trait enables path-based KMS key derivation (AAD) — the file path is passed to create_writer_properties() before the parquet writer is created
  • Global EncryptionFactoryRegistry: delta-rs operations create their own internal DataFusion sessions; users call register_encryption_factory(id, factory) once and the global registry bridges the gap
  • DeltaScanConfig.table_parquet_options derived from snapshot table properties — reads decrypt via the registered factory
  • Compact and Z-order optimize use DataFusion scan when encryption is configured

Files changed

  • crates/core/src/table/config.rsEncryptionConfig struct + EncryptionExt trait
  • crates/core/src/operations/write/encryption.rs — NEW: WriterPropertiesFactory, WriterEncryptionConfig, global registry
  • crates/core/src/operations/write/writer.rs — factory-based WriterConfig/LazyArrowWriter
  • crates/core/src/operations/write/execution.rs — derives factory from table config
  • crates/core/src/delta_datafusion/table_provider.rsDeltaScanConfig.table_parquet_options
  • crates/core/src/delta_datafusion/table_provider/next/scan/mod.rs — factory lookup in read path
  • crates/core/src/operations/optimize.rs — encrypted compact/z-order reads
  • crates/core/src/writer/json.rs + record_batch.rs — factory + path-based for AAD
  • crates/core/src/test_utils/kms_encryption.rs — NEW: MockKmsFactory for testing
  • crates/core/tests/commands_with_encryption.rs — NEW: integration tests
  • crates/deltalake/examples/basic_operations_encryption.rs — NEW: end-to-end example

Test plan

  • cargo build -p deltalake-core --features datafusion — clean
  • cargo test -p deltalake-core --features datafusion — all pass
  • cargo run --example basic_operations_encryption --features "datafusion integration-test" -p deltalake — end-to-end: create, write, Z-order, compact, read with KMS encryption

🤖 Generated with Claude Code

Implements Parquet encryption driven by standard delta.encryption.*
table properties stored in the delta log, following the RFC at
https://docs.google.com/document/d/1QVCUu1gAQtFux_63bvBBU1fzH-BQ2snXGVhq6ASfLjM

Key design:
- Users set delta.encryption.kms.id / .footer.key / .column.keys on the
  table; all subsequent operations encrypt/decrypt automatically with no
  per-operation API changes.
- EncryptionConfig parsed from TableProperties.unknown_properties follows
  the same pattern as WriterStatsConfig (from_config / from_snapshot).
- WriterPropertiesFactory async trait enables path-based KMS key
  derivation (AAD) for all write paths including LazyArrowWriter,
  JsonWriter, and RecordBatchWriter.
- Global EncryptionFactoryRegistry: users call
  register_encryption_factory(id, factory) once at startup; all delta-rs
  operations find it automatically even though they create internal
  DataFusion sessions.
- DeltaScanConfig.table_parquet_options is derived from snapshot table
  properties so reads decrypt without any per-scan configuration.
- Compact and Z-order optimize use DataFusion scan on encrypted tables.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

ACTION NEEDED

delta-rs follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces Parquet encryption support in delta-rs driven by standard delta.encryption.* Delta table properties, including a process-wide registry to make KMS factories available to internally-created DataFusion sessions.

Changes:

  • Adds EncryptionConfig parsing from table properties and propagates it into DataFusion scan options for decryption.
  • Introduces an async, path-aware WriterPropertiesFactory abstraction plus a global encryption factory registry for write-time key retrieval.
  • Updates write + optimize execution paths (and adds tests/example) to use encryption-aware readers/writers when encryption is configured.

Reviewed changes

Copilot reviewed 18 out of 18 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
crates/deltalake/examples/basic_operations_encryption.rs New end-to-end example showing encrypted create/write/optimize/read using table properties.
crates/deltalake/Cargo.toml Adds an integration-test feature and wires the new example behind feature gates.
crates/core/tests/commands_with_encryption.rs New integration tests covering encrypted create/read/optimize/delete/update.
crates/core/src/writer/record_batch.rs Switches RecordBatchWriter to a per-file WriterProperties factory and AAD path precomputation.
crates/core/src/writer/json.rs Switches JsonWriter to a per-file WriterProperties factory and AAD path precomputation.
crates/core/src/test_utils/mod.rs Exposes kms_encryption test utils when datafusion feature is enabled.
crates/core/src/test_utils/kms_encryption.rs Adds a mock KMS EncryptionFactory for tests/examples.
crates/core/src/table/config.rs Adds EncryptionConfig parsing and conversion to DataFusion TableParquetOptions.
crates/core/src/operations/write/writer.rs Refactors writer config to use async WriterPropertiesFactory (supports KMS + AAD).
crates/core/src/operations/write/mod.rs Exposes new operations::write::encryption module.
crates/core/src/operations/write/execution.rs Resolves writer factory from explicit props vs table encryption vs session defaults.
crates/core/src/operations/write/encryption.rs New writer-side encryption module: factory abstraction + global registry.
crates/core/src/operations/optimize.rs Uses encryption-aware read path and writer factory for compact/z-order when configured.
crates/core/src/delta_datafusion/table_provider.rs Propagates table encryption options into scan config and scan-time decryption factory wiring.
crates/core/src/delta_datafusion/table_provider/next/scan/mod.rs Passes table parquet options through to scans and wires decryption factory lookup on read.
crates/core/src/delta_datafusion/mod.rs Re-exports FileSelection for optimize compact encrypted path.
crates/core/Cargo.toml Enables Parquet encryption feature; adds dev-dependency.
Cargo.toml Enables DataFusion parquet_encryption feature at workspace level.
Comments suppressed due to low confidence (2)

crates/core/src/table/config.rs:597

  • EncryptionConfig::to_table_parquet_options() starts from TableParquetOptions::default() and only sets crypto fields. Call sites then use this value as a full replacement for the session’s parquet options, which can unintentionally reset non-crypto parquet settings (pushdown, schema options, etc.) for encrypted tables. Consider storing only the crypto overrides and merging them into the session’s TableParquetOptions at scan-build time, or populate opts.global from the session config when constructing scan options.
    pub fn to_table_parquet_options(&self) -> TableParquetOptions {
        let mut opts = TableParquetOptions::default();
        opts.crypto.factory_id = Some(self.kms_id.clone());
        opts.crypto.factory_options = self.factory_options();
        opts
    }

crates/core/src/table/config.rs:610

  • EncryptionConfig::factory_options() currently forwards only kms.configuration and footer.key. The parsed plaintext_footer and column_keys properties are never forwarded to the factory, so those table properties have no effect. Either encode these into EncryptionFactoryOptions (with stable option keys) or remove them from EncryptionConfig until supported to avoid misleading API behavior.
    /// Build [`EncryptionFactoryOptions`] from the KMS configuration string.
    #[cfg(feature = "datafusion")]
    pub fn factory_options(&self) -> EncryptionFactoryOptions {
        let mut opts = EncryptionFactoryOptions::default();
        if let Some(cfg) = &self.kms_configuration {
            opts.options
                .insert("kms.configuration".to_string(), cfg.clone());
        }
        opts.options
            .insert("footer.key".to_string(), self.footer_key.clone());
        opts
    }

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread crates/core/tests/commands_with_encryption.rs
Comment thread crates/core/src/operations/write/execution.rs Outdated
Comment thread crates/core/src/operations/write/execution.rs Outdated
Comment thread crates/core/src/operations/optimize.rs Outdated
Comment thread crates/core/src/delta_datafusion/table_provider.rs Outdated
Comment thread crates/core/src/table/config.rs Outdated
Comment thread crates/core/src/operations/write/encryption.rs
Comment thread crates/core/src/writer/record_batch.rs Outdated
Comment thread crates/core/src/writer/json.rs Outdated
Comment thread crates/core/Cargo.toml Outdated
corwinjoy and others added 4 commits May 20, 2026 17:56
- Remove unused fields (footer_key, column_keys, plaintext_footer) from
  KmsWriterPropertiesFactory — these are already forwarded via factory_options;
  add WriterEncryptionConfig::from_global_registry for legacy writers without a session.
- EncryptionConfig::factory_options() now forwards plaintext_footer and column_keys
  so table properties actually reach the EncryptionFactory implementation.
- footer_key is now required (from_properties returns None if absent) to prevent
  misconfigured tables from appearing valid.
- Merge crypto settings into session's parquet options at scan-build time rather than
  replacing them entirely — preserves non-crypto settings (pushdown, schema options).
- write_exec_plan, write_data_plan, and optimize propagate errors when encryption is
  configured but factory is not registered; no more silent unencrypted writes.
- Old DeltaScanBuilder scan path now returns error instead of silently skipping
  decryption when factory is missing — consistent with next-provider behaviour.
- JsonWriter and RecordBatchWriter now resolve the KMS factory from the global registry
  via from_global_registry(), enabling encryption on legacy write paths.
- Tests: unique KMS ID per test (prevents interference), #![cfg(feature = "datafusion")],
  negative test verifying that an unregistered factory produces a clear error.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…peration

- Add CreateBuilder::with_property(key, value) convenience method for
  setting arbitrary string table properties (used by tests and example).
- Fix read_table to use table.table_provider() instead of scan_table().build().
- Add assert_all_parquets_encrypted() helper that opens every .parquet file
  in the table directory with the raw parquet reader (no decryption) and
  asserts it fails — proving the file has an encrypted footer.
- Call assert_all_parquets_encrypted after every operation test (write,
  compact, z-order, delete, update) so that silent encryption failure is
  caught for each code path independently.
- Keep test_parquet_files_are_physically_encrypted as a dedicated standalone
  test focused purely on the encryption check.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add `unregistered_factory_error` + `resolve_encryption_factory_or_err`
  helpers so the "No EncryptionFactory registered" message is defined once
  and reused in all four lookup sites
- Extract shared `build_factory` logic in `WriterEncryptionConfig` to
  eliminate copy-paste between `from_config` and `from_global_registry`
- Add `FACTORY_OPT_*` constants in `config.rs` to replace raw string
  literals in `factory_options()`
- Remove `WriterConfig::new_with_properties` dead code (zero call sites)
- Avoid `parquet_options.clone()` in `DeltaScanBuilder` by extracting
  `factory_id` before consuming the options
- Pass `read_session` into `create_merge_plan` from the caller (which
  already knows if encryption is active) rather than re-querying the
  table properties a second time inside the function
- Simplify `test_parquet_files_are_physically_encrypted` to call the
  existing `assert_all_parquets_encrypted` helper instead of duplicating
  its `find_parquet` logic and dead `parquet_files` variable
- Fix broken `command_optimize` test callers to use `factory_from_writer_properties`
  and pass the new `read_session` parameter
- Remove "what" comments that restate the immediately following code

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Move WriterPropertiesFactory trait and DefaultWriterPropertiesFactory
  to a new writer/writer_factory.rs module that has no datafusion dependency,
  so json.rs and record_batch.rs can import these types without requiring
  the datafusion feature (fixes default build E0433 errors)
- Gate WriterEncryptionConfig usage in json.rs and record_batch.rs behind
  #[cfg(feature = "datafusion")]; non-datafusion builds use the default
  SNAPPY factory with no encryption
- Fix command_optimize.rs test callers of create_merge_plan that were still
  using the old WriterProperties type and missing the new read_session arg
  (fixes E0061 arity errors in check and unit-test CI jobs)
- Fix CreateBuilder::with_property to automatically disable
  raise_if_key_not_exists, since the method is explicitly designed for
  custom properties not in the TableProperty enum

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@codecov-commenter
Copy link
Copy Markdown

Welcome to Codecov 🎉

Once you merge this PR into your default branch, you're all set! Codecov will compare coverage reports and display results in all future pull requests.

ℹ️ You can also turn on project coverage checks and project coverage reporting on Pull Request comment

Thanks for integrating Codecov - We've got you covered ☂️

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 22 out of 22 changed files in this pull request and generated 7 comments.

Comment thread crates/core/src/writer/record_batch.rs Outdated
Comment thread crates/core/src/writer/json.rs Outdated
Comment thread crates/core/src/operations/write/execution.rs Outdated
Comment thread crates/core/src/operations/optimize.rs Outdated
Comment thread crates/core/src/table/config.rs
Comment thread crates/core/src/table/config.rs
Comment thread crates/core/tests/commands_with_encryption.rs Outdated
corwinjoy and others added 2 commits May 20, 2026 21:42
Extend MockKmsFactory to properly handle column-level encryption and
plaintext-footer mode by reading the options forwarded from
delta.encryption.* table properties:
- Key store changed from HashMap<filename, key> to HashMap<(filename, key_id), key>
  so each named key-id (footer key, column key) gets its own independent key
- parse_column_keys() helper extracts column→key_id mappings from the
  serialised "keyId:col1,col2" format
- get_file_encryption_properties() now sets plaintext_footer and calls
  with_column_key() for each configured column
- get_file_decryption_properties() provides matching per-column keys

Add test_encrypted_columnar_plaintext_footer:
- Encrypts only "int" and "string" columns via delta.encryption.column.keys
- Leaves the parquet footer in plaintext (delta.encryption.plaintext.footer=true)
- Verifies the round-trip read succeeds with the factory registered
- Verifies each parquet file has a readable footer (plaintext) but that
  reading column data WITHOUT decryption keys fails, confirming the
  column encryption was applied

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
1. Deterministic factory_options(): use BTreeMap and sort column names when
   serialising delta.encryption.column.keys back to wire format so the encoded
   string is stable across runs (important if factories use it as a cache key).

2. Misconfiguration detection: add EncryptionConfig::try_from_properties that
   returns an error when kms.id is set but footer.key is missing/empty, rather
   than silently returning None and writing plaintext. Used in
   WriterEncryptionConfig::from_config so write paths fail fast.

3. Propagate encryption errors in RecordBatchWriter::try_new: change
   unwrap_or_default() to ? so a missing factory causes a clear error instead
   of silently falling back to unencrypted writes.

4. Table encryption always wins in write_execution_plan_v2: when an encrypted
   table has WriterProperties supplied by the caller, the encrypted factory is
   still used — preventing accidental plaintext output for encrypted tables.

5. Fix is_encrypted in OptimizeBuilder: derive it from table properties directly
   rather than from the writer_factory variant, so read_session is always set for
   encrypted tables even when the caller supplies a writer_properties override
   (without this fix, encrypted compact reads would fail with a raw parquet reader).

6. Fix schema passed to create_writer_properties in json.rs: was passing the
   full table schema (with partition columns) but the Parquet writer uses the
   schema without partitions; now passes the actual file schema so column
   encryption settings match what is written.

7. Fix find_parquet in tests to use path.extension() == Some("parquet") instead
   of to_string_lossy().contains(".parquet"), avoiding false matches on files
   like file.parquet.crc or directories named ".parquet".

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@corwinjoy corwinjoy changed the title feat(encryption): Parquet encryption via delta.encryption.* table properties feat(encryption): parquet encryption via delta.encryption.* table properties May 21, 2026
corwinjoy and others added 3 commits May 20, 2026 22:22
write_execution_plan_v2 now uses the DataFusion session parquet config
(ZSTD level 3 by default) instead of a bare SNAPPY default when no
writer_properties are provided. This makes the write path consistent
with write_exec_plan, and produces slightly smaller parquet files.

Update the hardcoded byte offsets in test_delta_scan_uses_parquet_column_pruning
to reflect the actual (ZSTD-compressed) file layout.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
1. column_keys storage: change from col→key_id (HashMap<String,String>) to
   key_id→Vec<col> (HashMap<String,Vec<String>>) matching the wire format
   directly. Eliminates the inversion step in factory_options() on every write.

2. Read path misconfiguration detection: DeltaScanConfigBuilder::build and
   TableProviderBuilder::build now call try_from_properties instead of
   from_properties so a table with kms.id but no footer.key produces a
   clear error at scan time rather than silently reading without decryption.

3. Remove futures::executor::block_on from async test: replace with proper
   .await in test_encrypted_columnar_plaintext_footer, avoiding potential
   deadlock on single-threaded runtimes.

4. Remove paste! macro from test_matrix_create_and_read: the paste! wrapper
   added no parameterisation — replace with a plain #[tokio::test]. Also
   remove the unused paste dev-dependency from Cargo.toml.

5. Remove unused EncryptionExt import from encryption.rs (from_config now
   calls try_from_properties directly, not through the trait).

6. Suppress pre-existing dead_code warning on TableProviderBuilder::with_file_selection
   which has no non-test callers outside the delta_datafusion module.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Extract snappy_writer_properties() from DefaultWriterPropertiesFactory::snappy()
  so encryption.rs::build_factory can reuse it instead of reimplementing the
  SNAPPY + created_by WriterProperties inline

- Remove dead _DefaultFactory re-export alias from encryption.rs (was never
  referenced by any caller)

- Extract find_parquet() to module level in commands_with_encryption.rs,
  eliminating the identical inner-function duplicate in
  test_encrypted_columnar_plaintext_footer

- Move encryption factory lookup outside the per-store loop in
  get_read_plan (scan/mod.rs) — the factory_id is table-scoped and
  identical for every object-store group, so resolving it once avoids
  redundant DashMap shard-lock acquisitions for multi-store tables

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants