feat(elt-common): Configure iceberg warehouse based on pipeline directory name#376
Conversation
📝 WalkthroughWalkthrough
ChangesWarehouse-aware catalog connection and manifest routing
Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
elt-common/src/elt_common/iceberg/catalog.py (1)
27-41: 🗄️ Data Integrity & Integration | 🔵 Trivial | ⚡ Quick winCopy the catalog config before overriding
warehouse.get_catalog_config()returns a live mapping, so mutatingconfhere can leak the warehouse into later calls and silently retarget writes.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@elt-common/src/elt_common/iceberg/catalog.py` around lines 27 - 41, The catalog configuration returned by IcebergCatalogConfig.get_catalog_config() is being mutated in place when setting "warehouse", which can leak the override into later calls. In the catalog setup flow that loads config and assigns conf["warehouse"], first make a copy of the mapping before modifying it, then apply the warehouse override and continue using the copied config so the original catalog state remains unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@elt-common/src/elt_common/pipeline.py`:
- Line 19: The pipeline initialization in the class using `_warehouse` should
not rely on `root.parts[-1]`, because `Path(".")` can produce an empty parts
tuple and break setup. Update the assignment to use `root.name` with a safe
resolved fallback for cases like the current directory, so the logic in the
pipeline/root handling remains robust regardless of how ROOT is passed in.
In `@elt-pipelines/README.md`:
- Around line 52-54: The README text has a possessive typo in the description of
the transform pipelines. Update the wording in the relevant documentation
sentence to use “its final state” instead of “it’s final state” so the
`transform` subdirectory description reads correctly.
---
Nitpick comments:
In `@elt-common/src/elt_common/iceberg/catalog.py`:
- Around line 27-41: The catalog configuration returned by
IcebergCatalogConfig.get_catalog_config() is being mutated in place when setting
"warehouse", which can leak the override into later calls. In the catalog setup
flow that loads config and assigns conf["warehouse"], first make a copy of the
mapping before modifying it, then apply the warehouse override and continue
using the copied config so the original catalog state remains unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 1b4f41a2-6011-45ca-ab73-7b769781d060
📒 Files selected for processing (9)
elt-common/src/elt_common/iceberg/catalog.pyelt-common/src/elt_common/pipeline.pyelt-common/src/elt_common/runner.pyelt-common/src/elt_common/typing.pyelt-common/tests/unit_tests/iceberg/test_catalog.pyelt-common/tests/unit_tests/test_extract.pyelt-common/tests/unit_tests/test_pipeline.pyelt-common/tests/unit_tests/test_runner.pyelt-pipelines/README.md
| raise ValueError(f"Invalid project. Ingest directory '{ingest_dir}' does not exist.") | ||
|
|
||
| self._root = root | ||
| self._warehouse = root.parts[-1] |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟠 Major | ⚡ Quick win
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
python - <<'PY'
from pathlib import Path
for raw in [".", "", "./", "warehouse", "/"]:
p = Path(raw)
try:
last = p.parts[-1]
except Exception as exc:
last = f"{type(exc).__name__}: {exc}"
print(
f"{raw!r}: parts={p.parts!r}, name={p.name!r}, "
f"resolved_name={p.resolve().name!r}, parts[-1]={last!r}"
)
PYRepository: ISISNeutronMuon/analytics-data-platform
Length of output: 621
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Show the relevant file with line numbers and nearby context.
sed -n '1,220p' elt-common/src/elt_common/pipeline.py | cat -n
printf '\n---\n'
# Find where PipelinesProject is constructed and how root is passed.
rg -n "PipelinesProject\(" -S .
printf '\n---\n'
# Find any references to the warehouse derivation / _warehouse usage.
rg -n "_warehouse|destination_warehouse|warehouse_name" elt-common/src -SRepository: ISISNeutronMuon/analytics-data-platform
Length of output: 7101
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Inspect the call sites to see what kind of Path is passed as `root`.
sed -n '1,140p' elt-common/src/elt_common/cli.py | cat -n
printf '\n---\n'
sed -n '1,140p' elt-common/tests/unit_tests/test_pipeline.py | cat -nRepository: ISISNeutronMuon/analytics-data-platform
Length of output: 7138
Use root.name with a resolved fallback here. Path(".") is accepted by the CLI, and root.parts[-1] raises IndexError for that input, so project initialisation breaks when ROOT is the current directory.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@elt-common/src/elt_common/pipeline.py` at line 19, The pipeline
initialization in the class using `_warehouse` should not rely on
`root.parts[-1]`, because `Path(".")` can produce an empty parts tuple and break
setup. Update the assignment to use `root.name` with a safe resolved fallback
for cases like the current directory, so the logic in the pipeline/root handling
remains robust regardless of how ROOT is passed in.
| - Data from ingest pipelines is considered 'raw' data, and is loaded into a warehouse suffixed with `_landing`. | ||
| - Under construction: Each warehouse will also have a `transform` subdirectory containing pipelines for converting the raw data into it's final state in the target warehouse. | ||
| - Under construction: Each warehouse will also have a `transform` subdirectory containing pipelines for converting the | ||
| raw data into it's final state in the target warehouse. |
There was a problem hiding this comment.
📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win
Fix the possessive in the README.
Line 54 should use its final state, not it's final state.
Proposed fix
-- Under construction: Each warehouse will also have a `transform` subdirectory containing pipelines for converting the
- raw data into it's final state in the target warehouse.
+- Under construction: Each warehouse will also have a `transform` subdirectory containing pipelines for converting the
+ raw data into its final state in the target warehouse.📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| - Data from ingest pipelines is considered 'raw' data, and is loaded into a warehouse suffixed with `_landing`. | |
| - Under construction: Each warehouse will also have a `transform` subdirectory containing pipelines for converting the raw data into it's final state in the target warehouse. | |
| - Under construction: Each warehouse will also have a `transform` subdirectory containing pipelines for converting the | |
| raw data into it's final state in the target warehouse. | |
| - Data from ingest pipelines is considered 'raw' data, and is loaded into a warehouse suffixed with `_landing`. | |
| - Under construction: Each warehouse will also have a `transform` subdirectory containing pipelines for converting the | |
| raw data into its final state in the target warehouse. |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@elt-pipelines/README.md` around lines 52 - 54, The README text has a
possessive typo in the description of the transform pipelines. Update the
wording in the relevant documentation sentence to use “its final state” instead
of “it’s final state” so the `transform` subdirectory description reads
correctly.
Don't think this is likely to be a thing we want to do, but it's probably a little less fragile
martyngigg
left a comment
There was a problem hiding this comment.
Looks good to me. I tried with and without a warehouse name in my local yaml config and the log message works as expected.
This is a minor thing for now but might get more irritating, I get a lot of
2026-06-30 11:35:39 INFO pyiceberg.io — Loaded FileIO: pyiceberg.io.fsspec.FsspecFileIO
2026-06-30 11:35:39 INFO pyiceberg.io — Loaded FileIO: pyiceberg.io.fsspec.FsspecFileIO
2026-06-30 11:35:40 INFO pyiceberg.io.fsspec — Loading signer S3V4RestSigner
2026-06-30 11:35:41 INFO pyiceberg.io.fsspec — Loading signer S3V4RestSigner
2026-06-30 11:35:41 INFO pyiceberg.io.fsspec — Loading signer S3V4RestSigner
in the logs that feel more low level than INFO messages. Maybe we can look at lowering their priority by default at some point..
Might even be worth seeing if we can get this changed upstream, apache/iceberg-python@ceeb084 added a few logs that feel like they should be debug level. It looks like those actually make up the majority of logging calls from pyiceberg. |
Sounds good to me. I've had to put the odd PR here and there into Lakekeeper and Superset. The maintainers of pyiceberg seem welcoming of contributions. |
ref #321
Instead of requiring the
warehousevalue to be configured for pyiceberg, set it based on the pipeline directory name. This is as per the most recent couple of comments in the issue.I suspect
is_ingest_jobmay not be the best way to handle what it does, but probably best to revisit when we add thetransformfunctionality and have to actually start thinking about non-ingest jobs.Summary by CodeRabbit