Skip to content

Don't install data table entries for non-DM tool installs#22679

Merged
mvdbeek merged 13 commits into
galaxyproject:devfrom
natefoo:issue-21448-skip-shed-tool-data-tables
May 20, 2026
Merged

Don't install data table entries for non-DM tool installs#22679
mvdbeek merged 13 commits into
galaxyproject:devfrom
natefoo:issue-21448-skip-shed-tool-data-tables

Conversation

@natefoo
Copy link
Copy Markdown
Member

@natefoo natefoo commented May 12, 2026

This does not make it impossible, it just disables it by default.

Fixes #21448, which also contains rationale. I'll follow up and fix the specific tool that @bernt-matthias mentioned in the issue if it hasn't already been done.

How to test the changes?

(Select all options that apply)

  • I've included appropriate automated tests.
  • This is a refactoring of components with existing test coverage.
  • Instructions for manual testing are as follows:
    1. [add testing steps and prerequisites here if you didn't write automated tests covering all your changes]

License

  • I agree to license these and all my past contributions to the core galaxy codebase under the MIT license.

Comment thread client/src/components/Toolshed/RepositoryDetails/InstallationSettings.vue Outdated
Comment thread doc/source/admin/galaxy_options.rst Outdated
@mvdbeek
Copy link
Copy Markdown
Member

mvdbeek commented May 12, 2026

I would love it if there was just one way this works, is that not possible ? I dread the debugging sessions coming up with this additional setting ...

@natefoo
Copy link
Copy Markdown
Member Author

natefoo commented May 12, 2026

IMO the option should not exist at all but I added it to maintain existing functionality. There rationale is in the linked issue. I am happy to simply remove the option and the functionality for non-DM tools entirely.

@mvdbeek
Copy link
Copy Markdown
Member

mvdbeek commented May 12, 2026

Sounds good to me

natefoo added 4 commits May 12, 2026 17:01
Adds DataTableColumnMismatch and DataTableFileConflict, raised by
ToolDataTableManager.assert_data_table_consistency. Wired into
load_from_config_file so the invariant also fires at server startup.

Refs galaxyproject#21448.
Gate install_tool_data_tables and handle_missing_data_table_entry on
the repository being a Data Manager. Non-DM repos no longer mutate
shed_tool_data_table_conf.xml with .loc.sample contents.

Refs galaxyproject#21448.
Copy DM .loc files to tool_data_path/<basename> instead of a per-revision
subdir, and rewrite <file path> to match. Run assert_data_table_consistency
before persisting so column or file conflicts fail the install.

Refs galaxyproject#21448, fixes galaxyproject#20151.
@natefoo natefoo force-pushed the issue-21448-skip-shed-tool-data-tables branch from 62c036a to 0618a7f Compare May 12, 2026 21:14
@bernt-matthias
Copy link
Copy Markdown
Contributor

Do you plan to "fix" entries in non data manager data tables on usegalaxy.org, e.g. move them?

@natefoo
Copy link
Copy Markdown
Member Author

natefoo commented May 12, 2026

Pushed a second pass but I have to run for the day so just wanted to get tests to run. But I have not reviewed the code all that carefully and obviously there's immediately plenty to fix. Don't feel like you need to review it.

@bernt-matthias I don't actually have any DM data on usegalaxy.org, it's all in CVMFS. So I will most likely remove the entire contents of my shed_tool_data_table_conf.xml.

natefoo added 2 commits May 13, 2026 14:41
Galaxy ships picard_indexes and srma_indexes pointing at the same loc
file, so the cross-table file-path check raised on every startup.
Keep only the same-name column-mismatch protection and update the
production-shed test to expect no per-revision tool_data_table_conf.xml
for non-DM repos.
install_tool_data_tables used to create tool_data_path as a side effect
of os.makedirs(target_dir); skipping it for non-DM repos means
copy_sample_files now blows up with FileNotFoundError. Explicitly
mkdir(exist_ok=True) before copying.

The non-DM integration test was asserting against the in-memory
tool_data_tables dict, which is populated as tools are loaded
regardless of registration; switch to verify_no_installed_repository_data_table_entries,
which inspects shed_tool_data_table_conf.xml.
@natefoo natefoo force-pushed the issue-21448-skip-shed-tool-data-tables branch from 4a1679a to 07c5d44 Compare May 13, 2026 21:08
Comment thread lib/galaxy/tool_shed/tools/data_table_manager.py Outdated
@mvdbeek
Copy link
Copy Markdown
Member

mvdbeek commented May 14, 2026

The integration test errors are relevant. Tools can ship data tables with default entries, will that continue to work ? Is deduplication and rewriting the file on install maybe a better option ?

@natefoo
Copy link
Copy Markdown
Member Author

natefoo commented May 14, 2026

Tools can ship data tables with default entries, will that continue to work ?

No, my plan was to fix any tools that did this to hardcode into the tools as discussed in the linked issue, but if this is functionality we want to keep I guess I can make the installation of non-DM tool loc files conditional on whether they contain entries or not. Although I don't love it when the goal was to simplify and make things more deterministic.

@mvdbeek
Copy link
Copy Markdown
Member

mvdbeek commented May 14, 2026

I'm all for deterministic behavior, so branching on something being a dm or not irks me. dm's shouldn't be the way we distribute data tables. Isn't the right determinism to merge and de-dup the data table entries ?

@natefoo
Copy link
Copy Markdown
Member Author

natefoo commented May 14, 2026

I guess. DMs are very special in lots of ways so it doesn't bug me that we have another way in which they're special, although I agree that all the casing for them scattered around the code is never good. I still think we just should not support non-DM tools doing this at all, but if you want to keep it I'll defer.

First one is just a copy (so you get the comments in the sample of whichever repo you install first?) and then everything else is merged and deduped?

natefoo and others added 3 commits May 14, 2026 14:07
Replace the non-Data-Manager skip with a uniform install path: every
repo's tables get registered. First install copies the .loc.sample to
tool_data_path and writes the <table> entry without a
<tool_shed_repository> sub-element; subsequent installs of the same
table append unique rows from their .loc.sample to the shared loc file,
prefixed by a comment_char attribution line. get_filename_for_source
falls back to a no-repo_info filename so DMs on legacy installs keep
matching their existing <tool_shed_repository>-tagged files.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The Toolshed tests caught test_0025_verify_sample_files still asserting
the old "non-DM repos skip data-table registration" behavior. The
design pivot reverted that — non-DM repos now register tables — so the
test goes back to its original positive assertion and the unused
verify_no_installed_repository_data_table_entries helper is removed.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
When install_tool_data_tables hits a table that's already registered
(e.g. __dbkeys__ from Galaxy's shipped refgenie config, or bwa_indexes
from the shipped tabular sample), we skip adding another <table> to
shed_tool_data_table_conf.xml — but that also skipped registering the
shared loc file path in the existing table's `filenames` dict. Data
Managers then couldn't locate where to persist entries
("Unable to determine filename for persisting data table '__dbkeys__'")
because get_filename_for_source had nothing to fall back to.

Register the shared loc path on the existing table with
tool_shed_repository=None so the new fallback in get_filename_for_source
finds it. Also fix the integration test for devteam/bwa: bwa_indexes /
bwa_mem_indexes already exist in tool_data_table_conf.xml.sample, so
their absence from shed_tool_data_table_conf.xml is correct dedup;
assert the shared loc file lands at tool_data_path instead.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@natefoo
Copy link
Copy Markdown
Member Author

natefoo commented May 14, 2026

Not sure this latest change is the way... I wonder if we should at least have a separate dir (e.g. <tool_data_path>/shed) for shed-installed tables/locs that are distinct form admin-configured tables/locs.

natefoo added 4 commits May 15, 2026 10:51
Galaxy-managed loc files for shed-installed tools now land under
`tool_data_path/shed/` instead of the `tool_data_path` root. This keeps
them clearly separate from admin-configured loc files and from anything
shipped via `tool_data_table_conf.xml.sample`, avoiding overlap with
predefined tables.

Also fixes the DM persist failure (`Unable to determine filename for
persisting data table '__dbkeys__'`) by writing the (stamp-less)
`<table>` entry to `shed_tool_data_table_conf.xml` so the shared loc
file's association survives reload via `merge_tool_data_table`. Writes
are deduped against the existing shed config to avoid bloat. Refgenie-
backed and other non-tabular table types skip the `.loc.sample` row
merge (their `parse_file_fields` is not designed to read a `.loc`).
The bwa@051eba708f43 repo only registers the bwa_mem_indexes table
(bwa_mem_index.loc); the bwa_indexes table comes from a different
repo. Use the actual filename the repo installs so the assertion can
find the expected shed-managed loc file.
`copy_sample_files` during tool registration was still landing
shed-installed `.loc` files at the `tool_data_path` root alongside
admin-configured ones — `install_tool_data_tables` had already been
isolated under `tool_data_path/shed/`, but this second copy path was
not. Direct both `copy_sample_files` invocations to the shed loc dir
so all Galaxy-managed loc files stay isolated.
verify_installed_repository_data_table_entries was looking for the
installed loc file at tool_data_path root, but shed-installed loc files
now live under tool_data_path/shed/. Also collapse a black-fmt-flagged
multi-line copy_sample_files call onto one line.
repository_tools_tups = stdtm.handle_missing_data_table_entry(
relative_install_dir, tool_path, repository_tools_tups
)
if is_data_manager:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that an oversight ? If we dedupe we should keep this i think

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I asked claude about this and it got very very confused. I tend to think that's not claude's fault 🤣

source_loc_sample = loc_basename_to_source.get(basename)
if not source_loc_sample or not os.path.exists(source_loc_sample):
continue
# Keep `${__HERE__}` literal so the appended rows match the shared loc's existing format.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is HERE really a thing for sample data ? Wasn't that supposed to only be a thing for test tables ? Are we resolving and copying the files ? I think the answer should be no, but if we do want to support HERE we'd have to also move the files

existing_values.add(fields[value_index])
if not new_rows:
return self._loaded_content_version
with FileLock(filename):
Copy link
Copy Markdown
Member

@mvdbeek mvdbeek May 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We provide no guarantee whatsoever that this doesn't deadlock. It might be safer to write to a file next to the target, then do an atomic move ? I think we have a utility that does that.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we do use that pattern already, can you make a re-usable function and use that also in https://github.com/galaxyproject/galaxy/blob/dev/lib/galaxy/tool_util/data/__init__.py#L717-L726 ?

Comment thread lib/galaxy/tool_util/data/__init__.py
@mvdbeek
Copy link
Copy Markdown
Member

mvdbeek commented May 19, 2026

Did another pass with a code-review tool and a few additional things stood out beyond what's already been discussed:

lib/galaxy/tool_shed/tools/data_table_manager.py:175getattr(existing_table, "type_key", "tabular") on an attribute that's a class attribute on ToolDataTable. isinstance(existing_table, TabularToolDataTable) is already used 121 lines below as the same gate; one or the other.

lib/galaxy/tool_util/data/__init__.py:1071getattr(existing, "columns", None) in assert_data_table_consistency. columns is declared on ToolDataTable, just access it directly.

_shed_config_has_matching_entry (data_table_manager.py:194-220) — re-parses shed_tool_data_table_config from disk on every iteration of the elem loop in install_tool_data_tables. For a repo declaring N tables this is N XML parses per install. Hoist the parse outside the loop, or — better — move the dedup into ToolDataTableManager, which already owns this file's lifecycle, so the shed install manager doesn't reach in and re-read it.

_merge_loc_sample_entries dedup (tool_util/data/__init__.py:809-810) — keyed only on the value column. If another column for the same key differs (e.g. updated path for the same build ID), the new row is silently dropped with no log. At minimum a log.debug would help debugging; ideally surface it.

Test gaps worth filling:

  • test/unit/tool_shed/test_install_manager.py mocks ShedToolDataTableManager and then asserts handle_missing_data_table_entry.assert_not_called() — the gating logic itself is mocked, so this test would still pass if the gate were inverted. Worth swapping for the real manager against tmp_path (the pattern in test_data_table_manager.py::_make_stdtm).
  • No test for _shed_config_has_matching_entry with same name but different <file path> (the case where we should not dedupe).
  • test/integration/test_repository_operations.py only asserts the absence of the <tool_shed_repository> sub-element. It'd be good to also positively assert that the non-DM install produced a <table> entry pointing under shed/ — otherwise a regression where we stopped registering at all would slip through.

@mvdbeek
Copy link
Copy Markdown
Member

mvdbeek commented May 19, 2026

@natefoo here's what addressing the review looks like: natefoo#12 if you're happy with that let's :shipit:

@mvdbeek
Copy link
Copy Markdown
Member

mvdbeek commented May 20, 2026

I'll open a followup ...

@mvdbeek mvdbeek merged commit 18d76ed into galaxyproject:dev May 20, 2026
57 of 58 checks passed
@github-project-automation github-project-automation Bot moved this from Needs Review to Done in Galaxy Dev - weeklies May 20, 2026
@github-actions
Copy link
Copy Markdown

This PR was merged without a "kind/" label, please correct.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Development

Successfully merging this pull request may close these issues.

Do not add non-DM shed tools' tool_data_table_conf.xml.sample and associated loc file to shed_tool_data_table_conf.xml

4 participants