Don't install data table entries for non-DM tool installs#22679
Conversation
|
I would love it if there was just one way this works, is that not possible ? I dread the debugging sessions coming up with this additional setting ... |
|
IMO the option should not exist at all but I added it to maintain existing functionality. There rationale is in the linked issue. I am happy to simply remove the option and the functionality for non-DM tools entirely. |
|
Sounds good to me |
Adds DataTableColumnMismatch and DataTableFileConflict, raised by ToolDataTableManager.assert_data_table_consistency. Wired into load_from_config_file so the invariant also fires at server startup. Refs galaxyproject#21448.
Gate install_tool_data_tables and handle_missing_data_table_entry on the repository being a Data Manager. Non-DM repos no longer mutate shed_tool_data_table_conf.xml with .loc.sample contents. Refs galaxyproject#21448.
Copy DM .loc files to tool_data_path/<basename> instead of a per-revision subdir, and rewrite <file path> to match. Run assert_data_table_consistency before persisting so column or file conflicts fail the install. Refs galaxyproject#21448, fixes galaxyproject#20151.
62c036a to
0618a7f
Compare
|
Do you plan to "fix" entries in non data manager data tables on usegalaxy.org, e.g. move them? |
|
Pushed a second pass but I have to run for the day so just wanted to get tests to run. But I have not reviewed the code all that carefully and obviously there's immediately plenty to fix. Don't feel like you need to review it. @bernt-matthias I don't actually have any DM data on usegalaxy.org, it's all in CVMFS. So I will most likely remove the entire contents of my |
Galaxy ships picard_indexes and srma_indexes pointing at the same loc file, so the cross-table file-path check raised on every startup. Keep only the same-name column-mismatch protection and update the production-shed test to expect no per-revision tool_data_table_conf.xml for non-DM repos.
install_tool_data_tables used to create tool_data_path as a side effect of os.makedirs(target_dir); skipping it for non-DM repos means copy_sample_files now blows up with FileNotFoundError. Explicitly mkdir(exist_ok=True) before copying. The non-DM integration test was asserting against the in-memory tool_data_tables dict, which is populated as tools are loaded regardless of registration; switch to verify_no_installed_repository_data_table_entries, which inspects shed_tool_data_table_conf.xml.
4a1679a to
07c5d44
Compare
|
The integration test errors are relevant. Tools can ship data tables with default entries, will that continue to work ? Is deduplication and rewriting the file on install maybe a better option ? |
No, my plan was to fix any tools that did this to hardcode into the tools as discussed in the linked issue, but if this is functionality we want to keep I guess I can make the installation of non-DM tool loc files conditional on whether they contain entries or not. Although I don't love it when the goal was to simplify and make things more deterministic. |
|
I'm all for deterministic behavior, so branching on something being a dm or not irks me. dm's shouldn't be the way we distribute data tables. Isn't the right determinism to merge and de-dup the data table entries ? |
|
I guess. DMs are very special in lots of ways so it doesn't bug me that we have another way in which they're special, although I agree that all the casing for them scattered around the code is never good. I still think we just should not support non-DM tools doing this at all, but if you want to keep it I'll defer. First one is just a copy (so you get the comments in the sample of whichever repo you install first?) and then everything else is merged and deduped? |
Replace the non-Data-Manager skip with a uniform install path: every repo's tables get registered. First install copies the .loc.sample to tool_data_path and writes the <table> entry without a <tool_shed_repository> sub-element; subsequent installs of the same table append unique rows from their .loc.sample to the shared loc file, prefixed by a comment_char attribution line. get_filename_for_source falls back to a no-repo_info filename so DMs on legacy installs keep matching their existing <tool_shed_repository>-tagged files. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The Toolshed tests caught test_0025_verify_sample_files still asserting the old "non-DM repos skip data-table registration" behavior. The design pivot reverted that — non-DM repos now register tables — so the test goes back to its original positive assertion and the unused verify_no_installed_repository_data_table_entries helper is removed. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
When install_tool_data_tables hits a table that's already registered
(e.g. __dbkeys__ from Galaxy's shipped refgenie config, or bwa_indexes
from the shipped tabular sample), we skip adding another <table> to
shed_tool_data_table_conf.xml — but that also skipped registering the
shared loc file path in the existing table's `filenames` dict. Data
Managers then couldn't locate where to persist entries
("Unable to determine filename for persisting data table '__dbkeys__'")
because get_filename_for_source had nothing to fall back to.
Register the shared loc path on the existing table with
tool_shed_repository=None so the new fallback in get_filename_for_source
finds it. Also fix the integration test for devteam/bwa: bwa_indexes /
bwa_mem_indexes already exist in tool_data_table_conf.xml.sample, so
their absence from shed_tool_data_table_conf.xml is correct dedup;
assert the shared loc file lands at tool_data_path instead.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
Not sure this latest change is the way... I wonder if we should at least have a separate dir (e.g. |
Galaxy-managed loc files for shed-installed tools now land under `tool_data_path/shed/` instead of the `tool_data_path` root. This keeps them clearly separate from admin-configured loc files and from anything shipped via `tool_data_table_conf.xml.sample`, avoiding overlap with predefined tables. Also fixes the DM persist failure (`Unable to determine filename for persisting data table '__dbkeys__'`) by writing the (stamp-less) `<table>` entry to `shed_tool_data_table_conf.xml` so the shared loc file's association survives reload via `merge_tool_data_table`. Writes are deduped against the existing shed config to avoid bloat. Refgenie- backed and other non-tabular table types skip the `.loc.sample` row merge (their `parse_file_fields` is not designed to read a `.loc`).
The bwa@051eba708f43 repo only registers the bwa_mem_indexes table (bwa_mem_index.loc); the bwa_indexes table comes from a different repo. Use the actual filename the repo installs so the assertion can find the expected shed-managed loc file.
`copy_sample_files` during tool registration was still landing shed-installed `.loc` files at the `tool_data_path` root alongside admin-configured ones — `install_tool_data_tables` had already been isolated under `tool_data_path/shed/`, but this second copy path was not. Direct both `copy_sample_files` invocations to the shed loc dir so all Galaxy-managed loc files stay isolated.
verify_installed_repository_data_table_entries was looking for the installed loc file at tool_data_path root, but shed-installed loc files now live under tool_data_path/shed/. Also collapse a black-fmt-flagged multi-line copy_sample_files call onto one line.
| repository_tools_tups = stdtm.handle_missing_data_table_entry( | ||
| relative_install_dir, tool_path, repository_tools_tups | ||
| ) | ||
| if is_data_manager: |
There was a problem hiding this comment.
Is that an oversight ? If we dedupe we should keep this i think
There was a problem hiding this comment.
I asked claude about this and it got very very confused. I tend to think that's not claude's fault 🤣
| source_loc_sample = loc_basename_to_source.get(basename) | ||
| if not source_loc_sample or not os.path.exists(source_loc_sample): | ||
| continue | ||
| # Keep `${__HERE__}` literal so the appended rows match the shared loc's existing format. |
There was a problem hiding this comment.
is HERE really a thing for sample data ? Wasn't that supposed to only be a thing for test tables ? Are we resolving and copying the files ? I think the answer should be no, but if we do want to support HERE we'd have to also move the files
| existing_values.add(fields[value_index]) | ||
| if not new_rows: | ||
| return self._loaded_content_version | ||
| with FileLock(filename): |
There was a problem hiding this comment.
We provide no guarantee whatsoever that this doesn't deadlock. It might be safer to write to a file next to the target, then do an atomic move ? I think we have a utility that does that.
There was a problem hiding this comment.
I guess we do use that pattern already, can you make a re-usable function and use that also in https://github.com/galaxyproject/galaxy/blob/dev/lib/galaxy/tool_util/data/__init__.py#L717-L726 ?
|
Did another pass with a code-review tool and a few additional things stood out beyond what's already been discussed:
Test gaps worth filling:
|
|
@natefoo here's what addressing the review looks like: natefoo#12 if you're happy with that let's |
|
I'll open a followup ... |
|
This PR was merged without a "kind/" label, please correct. |
This does not make it impossible, it just disables it by default.
Fixes #21448, which also contains rationale. I'll follow up and fix the specific tool that @bernt-matthias mentioned in the issue if it hasn't already been done.
How to test the changes?
(Select all options that apply)
License