feat(encryption): parquet encryption via delta.encryption.* table properties#4
feat(encryption): parquet encryption via delta.encryption.* table properties#4corwinjoy wants to merge 10 commits into
Conversation
Implements Parquet encryption driven by standard delta.encryption.* table properties stored in the delta log, following the RFC at https://docs.google.com/document/d/1QVCUu1gAQtFux_63bvBBU1fzH-BQ2snXGVhq6ASfLjM Key design: - Users set delta.encryption.kms.id / .footer.key / .column.keys on the table; all subsequent operations encrypt/decrypt automatically with no per-operation API changes. - EncryptionConfig parsed from TableProperties.unknown_properties follows the same pattern as WriterStatsConfig (from_config / from_snapshot). - WriterPropertiesFactory async trait enables path-based KMS key derivation (AAD) for all write paths including LazyArrowWriter, JsonWriter, and RecordBatchWriter. - Global EncryptionFactoryRegistry: users call register_encryption_factory(id, factory) once at startup; all delta-rs operations find it automatically even though they create internal DataFusion sessions. - DeltaScanConfig.table_parquet_options is derived from snapshot table properties so reads decrypt without any per-scan configuration. - Compact and Z-order optimize use DataFusion scan on encrypted tables. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
ACTION NEEDED delta-rs follows the Conventional Commits specification for release automation. The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification. |
There was a problem hiding this comment.
Pull request overview
This PR introduces Parquet encryption support in delta-rs driven by standard delta.encryption.* Delta table properties, including a process-wide registry to make KMS factories available to internally-created DataFusion sessions.
Changes:
- Adds
EncryptionConfigparsing from table properties and propagates it into DataFusion scan options for decryption. - Introduces an async, path-aware
WriterPropertiesFactoryabstraction plus a global encryption factory registry for write-time key retrieval. - Updates write + optimize execution paths (and adds tests/example) to use encryption-aware readers/writers when encryption is configured.
Reviewed changes
Copilot reviewed 18 out of 18 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
| crates/deltalake/examples/basic_operations_encryption.rs | New end-to-end example showing encrypted create/write/optimize/read using table properties. |
| crates/deltalake/Cargo.toml | Adds an integration-test feature and wires the new example behind feature gates. |
| crates/core/tests/commands_with_encryption.rs | New integration tests covering encrypted create/read/optimize/delete/update. |
| crates/core/src/writer/record_batch.rs | Switches RecordBatchWriter to a per-file WriterProperties factory and AAD path precomputation. |
| crates/core/src/writer/json.rs | Switches JsonWriter to a per-file WriterProperties factory and AAD path precomputation. |
| crates/core/src/test_utils/mod.rs | Exposes kms_encryption test utils when datafusion feature is enabled. |
| crates/core/src/test_utils/kms_encryption.rs | Adds a mock KMS EncryptionFactory for tests/examples. |
| crates/core/src/table/config.rs | Adds EncryptionConfig parsing and conversion to DataFusion TableParquetOptions. |
| crates/core/src/operations/write/writer.rs | Refactors writer config to use async WriterPropertiesFactory (supports KMS + AAD). |
| crates/core/src/operations/write/mod.rs | Exposes new operations::write::encryption module. |
| crates/core/src/operations/write/execution.rs | Resolves writer factory from explicit props vs table encryption vs session defaults. |
| crates/core/src/operations/write/encryption.rs | New writer-side encryption module: factory abstraction + global registry. |
| crates/core/src/operations/optimize.rs | Uses encryption-aware read path and writer factory for compact/z-order when configured. |
| crates/core/src/delta_datafusion/table_provider.rs | Propagates table encryption options into scan config and scan-time decryption factory wiring. |
| crates/core/src/delta_datafusion/table_provider/next/scan/mod.rs | Passes table parquet options through to scans and wires decryption factory lookup on read. |
| crates/core/src/delta_datafusion/mod.rs | Re-exports FileSelection for optimize compact encrypted path. |
| crates/core/Cargo.toml | Enables Parquet encryption feature; adds dev-dependency. |
| Cargo.toml | Enables DataFusion parquet_encryption feature at workspace level. |
Comments suppressed due to low confidence (2)
crates/core/src/table/config.rs:597
EncryptionConfig::to_table_parquet_options()starts fromTableParquetOptions::default()and only sets crypto fields. Call sites then use this value as a full replacement for the session’s parquet options, which can unintentionally reset non-crypto parquet settings (pushdown, schema options, etc.) for encrypted tables. Consider storing only the crypto overrides and merging them into the session’sTableParquetOptionsat scan-build time, or populateopts.globalfrom the session config when constructing scan options.
pub fn to_table_parquet_options(&self) -> TableParquetOptions {
let mut opts = TableParquetOptions::default();
opts.crypto.factory_id = Some(self.kms_id.clone());
opts.crypto.factory_options = self.factory_options();
opts
}
crates/core/src/table/config.rs:610
EncryptionConfig::factory_options()currently forwards onlykms.configurationandfooter.key. The parsedplaintext_footerandcolumn_keysproperties are never forwarded to the factory, so those table properties have no effect. Either encode these intoEncryptionFactoryOptions(with stable option keys) or remove them fromEncryptionConfiguntil supported to avoid misleading API behavior.
/// Build [`EncryptionFactoryOptions`] from the KMS configuration string.
#[cfg(feature = "datafusion")]
pub fn factory_options(&self) -> EncryptionFactoryOptions {
let mut opts = EncryptionFactoryOptions::default();
if let Some(cfg) = &self.kms_configuration {
opts.options
.insert("kms.configuration".to_string(), cfg.clone());
}
opts.options
.insert("footer.key".to_string(), self.footer_key.clone());
opts
}
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Remove unused fields (footer_key, column_keys, plaintext_footer) from KmsWriterPropertiesFactory — these are already forwarded via factory_options; add WriterEncryptionConfig::from_global_registry for legacy writers without a session. - EncryptionConfig::factory_options() now forwards plaintext_footer and column_keys so table properties actually reach the EncryptionFactory implementation. - footer_key is now required (from_properties returns None if absent) to prevent misconfigured tables from appearing valid. - Merge crypto settings into session's parquet options at scan-build time rather than replacing them entirely — preserves non-crypto settings (pushdown, schema options). - write_exec_plan, write_data_plan, and optimize propagate errors when encryption is configured but factory is not registered; no more silent unencrypted writes. - Old DeltaScanBuilder scan path now returns error instead of silently skipping decryption when factory is missing — consistent with next-provider behaviour. - JsonWriter and RecordBatchWriter now resolve the KMS factory from the global registry via from_global_registry(), enabling encryption on legacy write paths. - Tests: unique KMS ID per test (prevents interference), #![cfg(feature = "datafusion")], negative test verifying that an unregistered factory produces a clear error. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…peration - Add CreateBuilder::with_property(key, value) convenience method for setting arbitrary string table properties (used by tests and example). - Fix read_table to use table.table_provider() instead of scan_table().build(). - Add assert_all_parquets_encrypted() helper that opens every .parquet file in the table directory with the raw parquet reader (no decryption) and asserts it fails — proving the file has an encrypted footer. - Call assert_all_parquets_encrypted after every operation test (write, compact, z-order, delete, update) so that silent encryption failure is caught for each code path independently. - Keep test_parquet_files_are_physically_encrypted as a dedicated standalone test focused purely on the encryption check. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add `unregistered_factory_error` + `resolve_encryption_factory_or_err` helpers so the "No EncryptionFactory registered" message is defined once and reused in all four lookup sites - Extract shared `build_factory` logic in `WriterEncryptionConfig` to eliminate copy-paste between `from_config` and `from_global_registry` - Add `FACTORY_OPT_*` constants in `config.rs` to replace raw string literals in `factory_options()` - Remove `WriterConfig::new_with_properties` dead code (zero call sites) - Avoid `parquet_options.clone()` in `DeltaScanBuilder` by extracting `factory_id` before consuming the options - Pass `read_session` into `create_merge_plan` from the caller (which already knows if encryption is active) rather than re-querying the table properties a second time inside the function - Simplify `test_parquet_files_are_physically_encrypted` to call the existing `assert_all_parquets_encrypted` helper instead of duplicating its `find_parquet` logic and dead `parquet_files` variable - Fix broken `command_optimize` test callers to use `factory_from_writer_properties` and pass the new `read_session` parameter - Remove "what" comments that restate the immediately following code Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Move WriterPropertiesFactory trait and DefaultWriterPropertiesFactory to a new writer/writer_factory.rs module that has no datafusion dependency, so json.rs and record_batch.rs can import these types without requiring the datafusion feature (fixes default build E0433 errors) - Gate WriterEncryptionConfig usage in json.rs and record_batch.rs behind #[cfg(feature = "datafusion")]; non-datafusion builds use the default SNAPPY factory with no encryption - Fix command_optimize.rs test callers of create_merge_plan that were still using the old WriterProperties type and missing the new read_session arg (fixes E0061 arity errors in check and unit-test CI jobs) - Fix CreateBuilder::with_property to automatically disable raise_if_key_not_exists, since the method is explicitly designed for custom properties not in the TableProperty enum Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Welcome to Codecov 🎉Once you merge this PR into your default branch, you're all set! Codecov will compare coverage reports and display results in all future pull requests. ℹ️ You can also turn on project coverage checks and project coverage reporting on Pull Request comment Thanks for integrating Codecov - We've got you covered ☂️ |
Extend MockKmsFactory to properly handle column-level encryption and plaintext-footer mode by reading the options forwarded from delta.encryption.* table properties: - Key store changed from HashMap<filename, key> to HashMap<(filename, key_id), key> so each named key-id (footer key, column key) gets its own independent key - parse_column_keys() helper extracts column→key_id mappings from the serialised "keyId:col1,col2" format - get_file_encryption_properties() now sets plaintext_footer and calls with_column_key() for each configured column - get_file_decryption_properties() provides matching per-column keys Add test_encrypted_columnar_plaintext_footer: - Encrypts only "int" and "string" columns via delta.encryption.column.keys - Leaves the parquet footer in plaintext (delta.encryption.plaintext.footer=true) - Verifies the round-trip read succeeds with the factory registered - Verifies each parquet file has a readable footer (plaintext) but that reading column data WITHOUT decryption keys fails, confirming the column encryption was applied Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
1. Deterministic factory_options(): use BTreeMap and sort column names when
serialising delta.encryption.column.keys back to wire format so the encoded
string is stable across runs (important if factories use it as a cache key).
2. Misconfiguration detection: add EncryptionConfig::try_from_properties that
returns an error when kms.id is set but footer.key is missing/empty, rather
than silently returning None and writing plaintext. Used in
WriterEncryptionConfig::from_config so write paths fail fast.
3. Propagate encryption errors in RecordBatchWriter::try_new: change
unwrap_or_default() to ? so a missing factory causes a clear error instead
of silently falling back to unencrypted writes.
4. Table encryption always wins in write_execution_plan_v2: when an encrypted
table has WriterProperties supplied by the caller, the encrypted factory is
still used — preventing accidental plaintext output for encrypted tables.
5. Fix is_encrypted in OptimizeBuilder: derive it from table properties directly
rather than from the writer_factory variant, so read_session is always set for
encrypted tables even when the caller supplies a writer_properties override
(without this fix, encrypted compact reads would fail with a raw parquet reader).
6. Fix schema passed to create_writer_properties in json.rs: was passing the
full table schema (with partition columns) but the Parquet writer uses the
schema without partitions; now passes the actual file schema so column
encryption settings match what is written.
7. Fix find_parquet in tests to use path.extension() == Some("parquet") instead
of to_string_lossy().contains(".parquet"), avoiding false matches on files
like file.parquet.crc or directories named ".parquet".
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
write_execution_plan_v2 now uses the DataFusion session parquet config (ZSTD level 3 by default) instead of a bare SNAPPY default when no writer_properties are provided. This makes the write path consistent with write_exec_plan, and produces slightly smaller parquet files. Update the hardcoded byte offsets in test_delta_scan_uses_parquet_column_pruning to reflect the actual (ZSTD-compressed) file layout. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
1. column_keys storage: change from col→key_id (HashMap<String,String>) to key_id→Vec<col> (HashMap<String,Vec<String>>) matching the wire format directly. Eliminates the inversion step in factory_options() on every write. 2. Read path misconfiguration detection: DeltaScanConfigBuilder::build and TableProviderBuilder::build now call try_from_properties instead of from_properties so a table with kms.id but no footer.key produces a clear error at scan time rather than silently reading without decryption. 3. Remove futures::executor::block_on from async test: replace with proper .await in test_encrypted_columnar_plaintext_footer, avoiding potential deadlock on single-threaded runtimes. 4. Remove paste! macro from test_matrix_create_and_read: the paste! wrapper added no parameterisation — replace with a plain #[tokio::test]. Also remove the unused paste dev-dependency from Cargo.toml. 5. Remove unused EncryptionExt import from encryption.rs (from_config now calls try_from_properties directly, not through the trait). 6. Suppress pre-existing dead_code warning on TableProviderBuilder::with_file_selection which has no non-test callers outside the delta_datafusion module. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Extract snappy_writer_properties() from DefaultWriterPropertiesFactory::snappy() so encryption.rs::build_factory can reuse it instead of reimplementing the SNAPPY + created_by WriterProperties inline - Remove dead _DefaultFactory re-export alias from encryption.rs (was never referenced by any caller) - Extract find_parquet() to module level in commands_with_encryption.rs, eliminating the identical inner-function duplicate in test_encrypted_columnar_plaintext_footer - Move encryption factory lookup outside the per-store loop in get_read_plan (scan/mod.rs) — the factory_id is table-scoped and identical for every object-store group, so resolving it once avoids redundant DashMap shard-lock acquisitions for multi-store tables Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Summary
Implements Parquet encryption driven by standard
delta.encryption.*table properties stored in the delta log, following the Modular Encryption in Delta Lake RFC.Key difference from previous approach: encryption is configured as table properties at creation time, not as a runtime API parameter. Once set, all operations (write, read, delete, update, merge, optimize) encrypt/decrypt automatically.
User-facing API
Design
EncryptionConfigparsed fromTableProperties.unknown_propertiesfollows the samefrom_config(table_config)pattern asWriterStatsConfigWriterPropertiesFactoryasync trait enables path-based KMS key derivation (AAD) — the file path is passed tocreate_writer_properties()before the parquet writer is createdEncryptionFactoryRegistry: delta-rs operations create their own internal DataFusion sessions; users callregister_encryption_factory(id, factory)once and the global registry bridges the gapDeltaScanConfig.table_parquet_optionsderived from snapshot table properties — reads decrypt via the registered factoryFiles changed
crates/core/src/table/config.rs—EncryptionConfigstruct +EncryptionExttraitcrates/core/src/operations/write/encryption.rs— NEW:WriterPropertiesFactory,WriterEncryptionConfig, global registrycrates/core/src/operations/write/writer.rs— factory-basedWriterConfig/LazyArrowWritercrates/core/src/operations/write/execution.rs— derives factory from table configcrates/core/src/delta_datafusion/table_provider.rs—DeltaScanConfig.table_parquet_optionscrates/core/src/delta_datafusion/table_provider/next/scan/mod.rs— factory lookup in read pathcrates/core/src/operations/optimize.rs— encrypted compact/z-order readscrates/core/src/writer/json.rs+record_batch.rs— factory + path-based for AADcrates/core/src/test_utils/kms_encryption.rs— NEW:MockKmsFactoryfor testingcrates/core/tests/commands_with_encryption.rs— NEW: integration testscrates/deltalake/examples/basic_operations_encryption.rs— NEW: end-to-end exampleTest plan
cargo build -p deltalake-core --features datafusion— cleancargo test -p deltalake-core --features datafusion— all passcargo run --example basic_operations_encryption --features "datafusion integration-test" -p deltalake— end-to-end: create, write, Z-order, compact, read with KMS encryption🤖 Generated with Claude Code