feat: credential vending for REST catalog#6
Conversation
7ff854f to
317cb43
Compare
f919c85 to
a218f7e
Compare
| - CATALOG_WAREHOUSE=file:///tmp/iceberg_warehouse | ||
| - CATALOG_INCLUDE__CREDENTIALS=true | ||
| - CATALOG_S3_ACCESS__KEY__ID=dummy-access-key | ||
| - CATALOG_S3_SECRET__ACCESS__KEY=dummy-secret-key |
There was a problem hiding this comment.
Design note: credential vending in the test fixture
We enable include-credentials with dummy S3 keys so the fixture's RESTServerCatalogAdapter injects them into LoadTableResponse.config. This lets us verify the full vending pipeline (header → server response → ResolveTableFileIO → FileIO::properties()) without needing MinIO.
Follow-up: For a true end-to-end test (reading data from MinIO using only vended credentials), we'd need to add MinIO here + build with ICEBERG_S3=ON in CI so the Arrow S3 FileIO can actually connect. Left as a follow-up since it requires CI config changes.
|
|
||
| // Merge order: catalog props < table config < storage credentials (highest priority) | ||
| auto merged = config_.configs(); | ||
| for (const auto& [k, v] : result.config) { |
There was a problem hiding this comment.
Merge ordering: catalog props < table config < storage credentials
This matches all three reference implementations:
- Java (
RESTSessionCatalog.java): merges catalog properties → table config → storage credentials - Python (
pyiceberg/catalog/rest/__init__.py):{**table_response.metadata.properties, **table_response.config, **credential_config} - Rust (
catalog/rest/src/catalog.rs): mergesresponse.configintoself.user_config.props(though Rust currently discardsstorage_credentials)
| best = &cred; | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Longest prefix match on metadata_location
This follows Python's approach (_resolve_storage_credentials() in pyiceberg/catalog/rest/__init__.py), which matches against metadata_location.
Java takes a different approach: it passes ALL credentials to the FileIO via SupportsStorageCredentials.setCredentials(), and the FileIO itself (e.g., S3FileIO) does per-file longest-prefix matching at read time. This is more correct for tables whose data files span multiple prefixes, but more complex.
We chose the Python pattern for simplicity. The spec's StorageCredential schema defines prefix + config but doesn't mandate the matching algorithm.
| /// Engines that need to configure their own storage access (e.g., for credential | ||
| /// vending) can read these properties to obtain the resolved credentials. | ||
| const std::unordered_map<std::string, std::string>& properties() const { | ||
| return properties_; |
There was a problem hiding this comment.
Why FileIO::properties() instead of Table::io_properties()?
We initially had io_properties() on the Table class, but after reviewing all three reference implementations:
- Java: engines use
table.io()directly; credentials live on FileIO viaSupportsStorageCredentials - Python: engines read
table.io.properties(a dict on FileIO) — e.g., the DataFusion integration explicitly passesself.io.properties - Rust: credentials flow through
table.file_io().config().props()
No implementation exposes credentials on the Table object itself. FileIO::properties() matches the Python pattern and is the standard surface for engines that need to configure their own storage access.
| } | ||
| return factory(properties); | ||
| ICEBERG_ASSIGN_OR_RAISE(auto io, factory(properties)); | ||
| io->properties_ = properties; |
There was a problem hiding this comment.
FileIORegistry::Load() automatically populates properties_ on every FileIO it creates. This means any FileIO from the registry (including per-table FileIOs from ResolveTableFileIO) will have the merged properties available via properties() without each factory needing to set them manually.
| /// \brief Returns the configuration properties used to initialize this FileIO. | ||
| /// | ||
| /// Engines that need to configure their own storage access (e.g., for credential | ||
| /// vending) can read these properties to obtain the resolved credentials. |
There was a problem hiding this comment.
Real-world engine validation: Trino
Trino (a major Java query engine) validates this exact pattern. Its IcebergRestCatalogFileSystemFactory extracts vended credentials by calling baseTable.io().properties() directly — it does NOT use Java's SupportsStorageCredentials interface.
The flow: BaseTable.io().properties() → IcebergTableCredentials → IcebergRestCatalogFileSystemFactory.create(identity, fileIoProperties) → extracts S3/GCS/Azure keys → injects them into Trino's ConnectorIdentity.extraCredentials → Trino's filesystem factories use the vended creds.
See: trino/plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/catalog/rest/IcebergRestCatalogFileSystemFactory.java
048ad43 to
bcbdc9c
Compare
| auto merged = | ||
| MergeTableProperties(catalog_props, result.config, cred ? cred->config : kEmpty); | ||
|
|
||
| // Detect FileIO type: explicit io-impl > warehouse scheme > metadata_location scheme. |
There was a problem hiding this comment.
matches pyiceberg?
| // No io-impl or warehouse configured. Fall back to a local FileIO as a | ||
| // default — the per-table ResolveTableFileIO will resolve the real FileIO | ||
| // when a table is loaded (using the table's location to detect S3, etc.). |
There was a problem hiding this comment.
rev:
| // No io-impl or warehouse configured. Fall back to a local FileIO as a | |
| // default — the per-table ResolveTableFileIO will resolve the real FileIO | |
| // when a table is loaded (using the table's location to detect S3, etc.). | |
| // No io-impl or warehouse configured. Fall back to a local FileIO as a | |
| // default — enabling per-table ResolveTableFileIO (vending). |
| // TODO: Python also merges result.metadata->properties into the chain here | ||
| // (metadata props < catalog < table config < credentials). Java does not. | ||
| // Consider adding this for parity with Python if needed. |
There was a problem hiding this comment.
rev: remove python comment and just mention our chain
then, PR comment the python thing as a TODO to flag please
There was a problem hiding this comment.
Done — removed the Python comment from code.
TODO for follow-up: Python also merges result.metadata->properties into the chain (metadata props < catalog < table config < credentials). Java and Rust do not. We match Java/Rust. Worth revisiting if we find a case where table-metadata-level io-impl is needed.
| #include "iceberg/result.h" | ||
|
|
||
| /// \file iceberg/catalog/rest/rest_file_io_internal.h | ||
| /// Internal helpers for per-table FileIO resolution in the REST catalog. |
There was a problem hiding this comment.
rev: this file isn't written for just this it happens to be that. internal helpers for REST file IO construction or smth pls
| /// | ||
| /// Merges catalog properties, table config, and the best-matching storage | ||
| /// credential (longest prefix match on metadata_location), then creates a | ||
| /// per-table FileIO via FileIORegistry. Falls back to \p catalog_io when |
There was a problem hiding this comment.
rev: spaces between Fa and \p intentional? fine if so
| ICEBERG_REST_EXPORT Result<std::shared_ptr<FileIO>> ResolveTableFileIO( | ||
| const std::shared_ptr<FileIO>& catalog_io, | ||
| const std::unordered_map<std::string, std::string>& catalog_props, | ||
| const std::string& warehouse, const LoadTableResult& result); |
There was a problem hiding this comment.
rev: should warehouse be stringview?
| } | ||
|
|
||
| private: | ||
| friend class FileIORegistry; |
There was a problem hiding this comment.
rev: PR comment on this please. friend is controversial
There was a problem hiding this comment.
friend class FileIORegistry is intentional: properties_ should only be populated by the registry (which creates FileIO instances with merged config). Making it protected would let any subclass set arbitrary properties, breaking the invariant that properties reflect the config used to create the FileIO. The friend constrains this to the single correct code path.
If a second factory path emerges later (e.g., custom FileIO outside the registry), we can add a protected setter at that point.
| namespace iceberg { | ||
|
|
||
| // --------------------------------------------------------------------------- | ||
| // DetectBuiltinFileIO tests |
There was a problem hiding this comment.
rev:
wait... is DetectBuiltinFileIO basically just a helper function too that's public I guess?
I don't see it in src/iceberg/catalog/rest/rest_file_io_internal.h so I guess it's on the main class? has it always been public?
then, let's maybe ditch the whole internal model and new file and just put resolve in that class too. I was concerned about visibility, but if there's precedence then fine. if we do that, then after comment on this PR please saying that we did it for this reason (otherwise you ask "why is this not internal"?)
There was a problem hiding this comment.
(ofc i'm talking about ResolveTableFileIO function
There was a problem hiding this comment.
Yes — DetectBuiltinFileIO, BuiltinFileIOName, and MakeCatalogFileIO are all already public in rest_file_io.h. Moved ResolveTableFileIO there too for consistency. Deleted rest_file_io_internal.h.
| .tv_sec = 1, | ||
| .tv_usec = 0, | ||
| struct timeval timeout { | ||
| .tv_sec = 1, .tv_usec = 0, |
There was a problem hiding this comment.
rev: this change is unrelated to the PR, did linter really flag it? wouldn't have thought so
There was a problem hiding this comment.
if linter let's us, let's try and not change things unrelated to the PR please
| // | ||
| // The REST fixture is configured with CATALOG_INCLUDE__CREDENTIALS=true and dummy | ||
| // S3 credentials (CATALOG_S3_ACCESS__KEY__ID / CATALOG_S3_SECRET__ACCESS__KEY). | ||
| // The fixture's RESTServerCatalogAdapter injects these into LoadTableResponse.config. | ||
| // These tests verify that the vended credentials flow through ResolveTableFileIO | ||
| // and are accessible on the per-table FileIO via FileIO::properties(). |
There was a problem hiding this comment.
rev: this comment doesn't add much, remove
Implement per-table credential vending in RestCatalog, following the pattern from iceberg-java's RESTSessionCatalog and PyIceberg's REST catalog. - Set X-Iceberg-Access-Delegation header at session level on HttpClient - ResolveTableFileIO merges catalog properties with per-table config and storage credentials (longest-prefix match), creating per-table FileIO via FileIORegistry when credentials are vended - StorageCredential struct with JSON serde and prefix-based resolution - Applied consistently to LoadTable, CreateTable, RegisterTable, StageCreateTable - Table exposes io_properties() for consumers that need vended configuration - Table::Refresh() updates io_properties_ to prevent stale credentials - Comprehensive unit tests (property merging, storage credential resolution, io_properties lifecycle) and REST integration tests Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
bcbdc9c to
620041b
Compare
Issue
Closes #10
iceberg-cpp's REST catalog has no support for per-table credential vending. When a REST catalog server returns vended credentials (temporary S3 access keys) in the
LoadTableResponse, iceberg-cpp ignores them and uses the catalog-level FileIO for all tables.Consumers using iceberg-cpp to read tables from a credential-vending REST catalog get authentication failures when trying to read data files:
The catalog discovery and table metadata loading succeed (those use the catalog-level credentials), but reading the actual data files fails because the per-table S3 credentials from the REST catalog response are never extracted or applied.
PR
Implement per-table credential vending in RestCatalog, following the pattern from iceberg-java's RESTSessionCatalog and PyIceberg's REST catalog.
Credential vending pipeline
X-Iceberg-Access-Delegation: vended-credentialsheader on all table-returning operations (LoadTable, CreateTable, RegisterTable, StageCreateTable)ResolveTableFileIOmerges catalog properties < table config < storage credentials (longest prefix match), then creates a per-table FileIO viaFileIORegistryFileIO::properties(), matching Python'stable.io.propertiespatternAPI design:
FileIO::properties()We initially exposed
io_properties()on theTableclass, but after reviewing how engines actually consume vended credentials across the ecosystem:How each implementation surfaces credentials to engines:
FileIO.properties()— Trino'sIcebergRestCatalogFileSystemFactorycallsbaseTable.io().properties()directly to extract S3/GCS/Azure credentials and inject them into its own filesystem layer. It does NOT use theSupportsStorageCredentialsinterface.table.io.properties— PyIceberg's DataFusion integration passesself.io.propertiesto configure its native filesystemtable.file_io().config().props()— credentials flow through FileIO's configNo implementation exposes credentials on the Table object.
FileIO::properties()is the standard surface. So we:properties()accessor to theFileIObase classFileIORegistry::Load()automatically populatesFileIO::properties_with the merged configio_propertiesfromTable/StagedTableentirelyResolvedTableIOstruct (no longer needed since properties live on FileIO)Integration tests
The REST fixture is configured with
CATALOG_INCLUDE__CREDENTIALS=trueand dummy S3 credentials. The fixture'sRESTServerCatalogAdapterinjects these intoLoadTableResponse.configwhen credential vending is enabled. Integration tests verify that vended credentials flow throughResolveTableFileIOand are accessible viatable->io()->properties()on all table-returning operations.A full end-to-end test (reading from MinIO with vended credentials via Arrow S3 FileIO) would require
ICEBERG_S3=ONin CI — left as a follow-up.Known limitations
VendedCredentialsProviderwith expiry tracking)metadata_location(like Python); Java defers matching to the FileIO per file access