Don't install data table entries for non-DM tool installs by natefoo · Pull Request #22679 · galaxyproject/galaxy

natefoo · 2026-05-12T18:12:52Z

This does not make it impossible, it just disables it by default.

Fixes #21448, which also contains rationale. I'll follow up and fix the specific tool that @bernt-matthias mentioned in the issue if it hasn't already been done.

How to test the changes?

(Select all options that apply)

I've included appropriate automated tests.
This is a refactoring of components with existing test coverage.
Instructions for manual testing are as follows:
1. [add testing steps and prerequisites here if you didn't write automated tests covering all your changes]

License

I agree to license these and all my past contributions to the core galaxy codebase under the MIT license.

mvdbeek · 2026-05-12T18:45:55Z

I would love it if there was just one way this works, is that not possible ? I dread the debugging sessions coming up with this additional setting ...

natefoo · 2026-05-12T19:19:47Z

IMO the option should not exist at all but I added it to maintain existing functionality. There rationale is in the linked issue. I am happy to simply remove the option and the functionality for non-DM tools entirely.

mvdbeek · 2026-05-12T19:22:45Z

Sounds good to me

Adds DataTableColumnMismatch and DataTableFileConflict, raised by ToolDataTableManager.assert_data_table_consistency. Wired into load_from_config_file so the invariant also fires at server startup. Refs galaxyproject#21448.

Gate install_tool_data_tables and handle_missing_data_table_entry on the repository being a Data Manager. Non-DM repos no longer mutate shed_tool_data_table_conf.xml with .loc.sample contents. Refs galaxyproject#21448.

Copy DM .loc files to tool_data_path/<basename> instead of a per-revision subdir, and rewrite <file path> to match. Run assert_data_table_consistency before persisting so column or file conflicts fail the install. Refs galaxyproject#21448, fixes galaxyproject#20151.

bernt-matthias · 2026-05-12T21:15:32Z

Do you plan to "fix" entries in non data manager data tables on usegalaxy.org, e.g. move them?

natefoo · 2026-05-12T21:20:06Z

Pushed a second pass but I have to run for the day so just wanted to get tests to run. But I have not reviewed the code all that carefully and obviously there's immediately plenty to fix. Don't feel like you need to review it.

@bernt-matthias I don't actually have any DM data on usegalaxy.org, it's all in CVMFS. So I will most likely remove the entire contents of my shed_tool_data_table_conf.xml.

Galaxy ships picard_indexes and srma_indexes pointing at the same loc file, so the cross-table file-path check raised on every startup. Keep only the same-name column-mismatch protection and update the production-shed test to expect no per-revision tool_data_table_conf.xml for non-DM repos.

install_tool_data_tables used to create tool_data_path as a side effect of os.makedirs(target_dir); skipping it for non-DM repos means copy_sample_files now blows up with FileNotFoundError. Explicitly mkdir(exist_ok=True) before copying. The non-DM integration test was asserting against the in-memory tool_data_tables dict, which is populated as tools are loaded regardless of registration; switch to verify_no_installed_repository_data_table_entries, which inspects shed_tool_data_table_conf.xml.

mvdbeek · 2026-05-14T08:42:56Z

The integration test errors are relevant. Tools can ship data tables with default entries, will that continue to work ? Is deduplication and rewriting the file on install maybe a better option ?

natefoo · 2026-05-14T14:26:29Z

Tools can ship data tables with default entries, will that continue to work ?

No, my plan was to fix any tools that did this to hardcode into the tools as discussed in the linked issue, but if this is functionality we want to keep I guess I can make the installation of non-DM tool loc files conditional on whether they contain entries or not. Although I don't love it when the goal was to simplify and make things more deterministic.

mvdbeek · 2026-05-14T14:49:45Z

I'm all for deterministic behavior, so branching on something being a dm or not irks me. dm's shouldn't be the way we distribute data tables. Isn't the right determinism to merge and de-dup the data table entries ?

natefoo · 2026-05-14T15:06:42Z

I guess. DMs are very special in lots of ways so it doesn't bug me that we have another way in which they're special, although I agree that all the casing for them scattered around the code is never good. I still think we just should not support non-DM tools doing this at all, but if you want to keep it I'll defer.

First one is just a copy (so you get the comments in the sample of whichever repo you install first?) and then everything else is merged and deduped?

Replace the non-Data-Manager skip with a uniform install path: every repo's tables get registered. First install copies the .loc.sample to tool_data_path and writes the <table> entry without a <tool_shed_repository> sub-element; subsequent installs of the same table append unique rows from their .loc.sample to the shared loc file, prefixed by a comment_char attribution line. get_filename_for_source falls back to a no-repo_info filename so DMs on legacy installs keep matching their existing <tool_shed_repository>-tagged files. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The Toolshed tests caught test_0025_verify_sample_files still asserting the old "non-DM repos skip data-table registration" behavior. The design pivot reverted that — non-DM repos now register tables — so the test goes back to its original positive assertion and the unused verify_no_installed_repository_data_table_entries helper is removed. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

When install_tool_data_tables hits a table that's already registered (e.g. __dbkeys__ from Galaxy's shipped refgenie config, or bwa_indexes from the shipped tabular sample), we skip adding another <table> to shed_tool_data_table_conf.xml — but that also skipped registering the shared loc file path in the existing table's `filenames` dict. Data Managers then couldn't locate where to persist entries ("Unable to determine filename for persisting data table '__dbkeys__'") because get_filename_for_source had nothing to fall back to. Register the shared loc path on the existing table with tool_shed_repository=None so the new fallback in get_filename_for_source finds it. Also fix the integration test for devteam/bwa: bwa_indexes / bwa_mem_indexes already exist in tool_data_table_conf.xml.sample, so their absence from shed_tool_data_table_conf.xml is correct dedup; assert the shared loc file lands at tool_data_path instead. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

natefoo · 2026-05-14T19:51:38Z

Not sure this latest change is the way... I wonder if we should at least have a separate dir (e.g. <tool_data_path>/shed) for shed-installed tables/locs that are distinct form admin-configured tables/locs.

Galaxy-managed loc files for shed-installed tools now land under `tool_data_path/shed/` instead of the `tool_data_path` root. This keeps them clearly separate from admin-configured loc files and from anything shipped via `tool_data_table_conf.xml.sample`, avoiding overlap with predefined tables. Also fixes the DM persist failure (`Unable to determine filename for persisting data table '__dbkeys__'`) by writing the (stamp-less) `<table>` entry to `shed_tool_data_table_conf.xml` so the shared loc file's association survives reload via `merge_tool_data_table`. Writes are deduped against the existing shed config to avoid bloat. Refgenie- backed and other non-tabular table types skip the `.loc.sample` row merge (their `parse_file_fields` is not designed to read a `.loc`).

The bwa@051eba708f43 repo only registers the bwa_mem_indexes table (bwa_mem_index.loc); the bwa_indexes table comes from a different repo. Use the actual filename the repo installs so the assertion can find the expected shed-managed loc file.

`copy_sample_files` during tool registration was still landing shed-installed `.loc` files at the `tool_data_path` root alongside admin-configured ones — `install_tool_data_tables` had already been isolated under `tool_data_path/shed/`, but this second copy path was not. Direct both `copy_sample_files` invocations to the shed loc dir so all Galaxy-managed loc files stay isolated.

verify_installed_repository_data_table_entries was looking for the installed loc file at tool_data_path root, but shed-installed loc files now live under tool_data_path/shed/. Also collapse a black-fmt-flagged multi-line copy_sample_files call onto one line.

mvdbeek · 2026-05-19T08:18:37Z

-                repository_tools_tups = stdtm.handle_missing_data_table_entry(
-                    relative_install_dir, tool_path, repository_tools_tups
-                )
+                if is_data_manager:


Is that an oversight ? If we dedupe we should keep this i think

I asked claude about this and it got very very confused. I tend to think that's not claude's fault 🤣

mvdbeek · 2026-05-19T08:22:31Z

+            source_loc_sample = loc_basename_to_source.get(basename)
+            if not source_loc_sample or not os.path.exists(source_loc_sample):
+                continue
+            # Keep `${__HERE__}` literal so the appended rows match the shared loc's existing format.


is HERE really a thing for sample data ? Wasn't that supposed to only be a thing for test tables ? Are we resolving and copying the files ? I think the answer should be no, but if we do want to support HERE we'd have to also move the files

mvdbeek · 2026-05-19T08:24:30Z

+                existing_values.add(fields[value_index])
+        if not new_rows:
+            return self._loaded_content_version
+        with FileLock(filename):


We provide no guarantee whatsoever that this doesn't deadlock. It might be safer to write to a file next to the target, then do an atomic move ? I think we have a utility that does that.

I guess we do use that pattern already, can you make a re-usable function and use that also in https://github.com/galaxyproject/galaxy/blob/dev/lib/galaxy/tool_util/data/__init__.py#L717-L726 ?

mvdbeek · 2026-05-19T09:05:25Z

Did another pass with a code-review tool and a few additional things stood out beyond what's already been discussed:

lib/galaxy/tool_shed/tools/data_table_manager.py:175 — getattr(existing_table, "type_key", "tabular") on an attribute that's a class attribute on ToolDataTable. isinstance(existing_table, TabularToolDataTable) is already used 121 lines below as the same gate; one or the other.

lib/galaxy/tool_util/data/__init__.py:1071 — getattr(existing, "columns", None) in assert_data_table_consistency. columns is declared on ToolDataTable, just access it directly.

_shed_config_has_matching_entry (data_table_manager.py:194-220) — re-parses shed_tool_data_table_config from disk on every iteration of the elem loop in install_tool_data_tables. For a repo declaring N tables this is N XML parses per install. Hoist the parse outside the loop, or — better — move the dedup into ToolDataTableManager, which already owns this file's lifecycle, so the shed install manager doesn't reach in and re-read it.

_merge_loc_sample_entries dedup (tool_util/data/__init__.py:809-810) — keyed only on the value column. If another column for the same key differs (e.g. updated path for the same build ID), the new row is silently dropped with no log. At minimum a log.debug would help debugging; ideally surface it.

Test gaps worth filling:

test/unit/tool_shed/test_install_manager.py mocks ShedToolDataTableManager and then asserts handle_missing_data_table_entry.assert_not_called() — the gating logic itself is mocked, so this test would still pass if the gate were inverted. Worth swapping for the real manager against tmp_path (the pattern in test_data_table_manager.py::_make_stdtm).
No test for _shed_config_has_matching_entry with same name but different <file path> (the case where we should not dedupe).
test/integration/test_repository_operations.py only asserts the absence of the <tool_shed_repository> sub-element. It'd be good to also positively assert that the non-DM install produced a <table> entry pointing under shed/ — otherwise a regression where we stopped registering at all would slip through.

mvdbeek · 2026-05-19T15:08:50Z

@natefoo here's what addressing the review looks like: natefoo#12 if you're happy with that let's

mvdbeek · 2026-05-20T17:12:03Z

I'll open a followup ...

github-actions · 2026-05-20T19:33:11Z

This PR was merged without a "kind/" label, please correct.

github-project-automation Bot added this to Galaxy Dev - weeklies May 12, 2026

github-project-automation Bot moved this to Needs Review in Galaxy Dev - weeklies May 12, 2026

github-actions Bot added area/documentation area/UI-UX area/testing area/API area/admin labels May 12, 2026

github-actions Bot added this to the 26.1 milestone May 12, 2026

mvdbeek reviewed May 12, 2026

View reviewed changes

Comment thread client/src/components/Toolshed/RepositoryDetails/InstallationSettings.vue Outdated

Comment thread doc/source/admin/galaxy_options.rst Outdated

natefoo added 4 commits May 12, 2026 17:01

Add shared data table consistency check

94a027c

Adds DataTableColumnMismatch and DataTableFileConflict, raised by ToolDataTableManager.assert_data_table_consistency. Wired into load_from_config_file so the invariant also fires at server startup. Refs galaxyproject#21448.

Skip data table registration for non-Data-Manager shed repos

cf33b2a

Gate install_tool_data_tables and handle_missing_data_table_entry on the repository being a Data Manager. Non-DM repos no longer mutate shed_tool_data_table_conf.xml with .loc.sample contents. Refs galaxyproject#21448.

Update tool shed install tests for non-DM data table skip

0618a7f

natefoo force-pushed the issue-21448-skip-shed-tool-data-tables branch from 62c036a to 0618a7f Compare May 12, 2026 21:14

natefoo added 2 commits May 13, 2026 14:41

natefoo force-pushed the issue-21448-skip-shed-tool-data-tables branch from 4a1679a to 07c5d44 Compare May 13, 2026 21:08

mvdbeek reviewed May 14, 2026

View reviewed changes

Comment thread lib/galaxy/tool_shed/tools/data_table_manager.py Outdated

natefoo and others added 3 commits May 14, 2026 14:07

natefoo added 4 commits May 15, 2026 10:51

mvdbeek reviewed May 19, 2026

View reviewed changes

Comment thread lib/galaxy/tool_util/data/__init__.py

mvdbeek merged commit 18d76ed into galaxyproject:dev May 20, 2026
57 of 58 checks passed

github-project-automation Bot moved this from Needs Review to Done in Galaxy Dev - weeklies May 20, 2026

nsoranzo added the kind/bug label May 20, 2026

mvdbeek mentioned this pull request May 21, 2026

Fix shed install column mismatch from synthesized name alias #22733

Open

2 tasks

Conversation

natefoo commented May 12, 2026

How to test the changes?

License

Uh oh!

Uh oh!

Uh oh!

mvdbeek commented May 12, 2026

Uh oh!

natefoo commented May 12, 2026

Uh oh!

mvdbeek commented May 12, 2026

Uh oh!

bernt-matthias commented May 12, 2026

Uh oh!

natefoo commented May 12, 2026

Uh oh!

Uh oh!

mvdbeek commented May 14, 2026

Uh oh!

natefoo commented May 14, 2026

Uh oh!

mvdbeek commented May 14, 2026

Uh oh!

natefoo commented May 14, 2026

Uh oh!

natefoo commented May 14, 2026

Uh oh!

mvdbeek May 19, 2026

Choose a reason for hiding this comment

Uh oh!

mvdbeek May 19, 2026

Choose a reason for hiding this comment

Uh oh!

mvdbeek May 19, 2026

Choose a reason for hiding this comment

Uh oh!

mvdbeek May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mvdbeek May 19, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mvdbeek commented May 19, 2026

Uh oh!

mvdbeek commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mvdbeek commented May 20, 2026

Uh oh!

Uh oh!

github-actions Bot commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mvdbeek May 19, 2026 •

edited

Loading

mvdbeek commented May 19, 2026 •

edited

Loading