Load tools on demand via LazyToolBox backed by a pluggable tool source store#22633
Load tools on demand via LazyToolBox backed by a pluggable tool source store#22633mvdbeek wants to merge 72 commits into
Conversation
Foundational helpers used by the tool source storage stack: - id_util.py: shared logic for normalizing and extracting tool IDs. - tool_source_store/models.py: StoredToolSource and related dataclasses used by every backend.
Introduces the ``tool_index`` table that backs the database tool source store and the LazyToolBox index. The data column uses ``LargeBinary`` with a MySQL ``LONGBLOB`` variant because real-world tool indexes easily exceed the 64KB BLOB default. Concurrent writers must serialize their updates externally; the singleton-style unique constraint on ``version`` enforces a delete-then-insert pattern.
Pluggable backends for storing pre-parsed tool sources, behind a ToolSourceStore abstract base: - database: stores tool sources in the Galaxy database via the existing ToolSource and tool_index tables. - sqlalchemy: ships a separate SQLAlchemy-managed database (typically a read-only SQLite bundle) for distributing pre-built tool sources. - composite: layers multiple stores so reads fall through declared backends in order while writes always land on the default. - index: shared tool-index serialization helpers used by the database and sqlalchemy backends. - benchmarks: deserialization performance suite. Includes unit tests for the base contract, the SQLite/SQLAlchemy backend, and the composite store.
scripts/tool_source/populate_store.py walks tool_conf files, parses each tool source, and writes the result into the configured tool source store. Supports an incremental --watch mode that re-hashes raw file contents and updates the store as tools change on disk; the matching cache-invalidation message is added in the queue worker commit. _discover.py contains the shared tool-discovery helper. Includes unit tests that exercise the script through fakes (rather than mocks) so failures point at real behavior.
New config keys for the tool source storage stack and the LazyToolBox: - tool_source_store / tool_source_disk_path / tool_source_stores: configure the active backend and any per-conf composite entries. - use_lazy_toolbox / lazy_toolbox_cache_size / tool_cache_size: opt in to LazyToolBox and tune its in-memory cache. LazyToolBox opt-in is explicit — a populated store does not flip a default deployment to lazy mode unless ``use_lazy_toolbox`` is set.
| if stored is None: | ||
| # Quick extraction of tool_id from raw XML without full macro expansion | ||
| try: | ||
| extracted_id = extract_tool_id_from_file(str(config_file), max_read=2000) |
There was a problem hiding this comment.
You cannot do this without full macro expansion.
|
Claude thinks 85% of this (their estimate) can be separated into abstractions that would fix the startup costs and that I would be really excited about - and move 15% into something I would want to fight about and most the breaking things and things Claude describes as "The 15% you're deferring is where the filter bypass lives, where the search rewrite lives, and where the recent invariant-restoration commits cluster." Here is the comment Claude thinks I should post (they eviscerated the 6 paragraphs I originally wrote up): Substantive PR — the startup problem is real and the store + slim materializer parts look good. Pushing back on scope and on a couple of read-path defects: Scope. #21247 framed the goal as "parsing the tools on each startup is also by far the most time consuming step" — startup + steady-state memory. Under Path 1 there you flagged batch endpoints as a challenge with a stated mitigation: "we could cache the response aggressively since it rarely changes." This PR addresses both the goal and the challenge in one diff. Would it be reasonable to land the store + slim LazyToolBox first and revisit the Read-path defects worth surfacing regardless of the split:
Search. Maturity. The single-tool materialization path looks settled. The panel-render / batch-path bits are still landing shed-install ordering, cross-version lineage, EDAM render and Suggested split:
|
I don't know how Claude managed to get the context so wrong. We did that already. The main thing is that we need to be able to run more web handlers. I can see how the review is easier if the PR is split up, but if we never land the web handler work then it isn't worth the complexity. We're not limited by the job or workflow handler processes. That said the whole thing isn't ready, the panel work is a real pain. |
I'm sure it is a John problem and not a Claude problem. Are you saying there is no useful subpart of this that doesn't bypass the toolbox filters, per-user toolbox rendering? The pre-built indexes are what you care about here and the rest of it is support infrastructure that is not useful on its own? |
|
It would be somewhat useful (faster handler restarts, less baseline memory consumption), but as long as the web handler still eager load the whole toolbox (on first request or on startup) we're not addressing any production issues. I will tweak this to see if there is a better solution for the toolbox filters.
This isn't dead, but i don't love the complexity in managing these extra services. Maybe that's fine though, you'd probably only need this at the usegalaxy scale ... |
LazyToolBox extends ToolBox and loads each Tool on demand from the configured tool source store, backed by an LRU cache sized via ``lazy_toolbox_cache_size``. The pre-computed index lets batch endpoints (panels, tests summary, requirements) avoid materializing every tool object. - tools/lazy_toolbox.py: the on-demand toolbox plus its index cache invalidation hook. - tool_util/toolbox/base.py and parser.py: small extension points needed for lazy loading. - tools/__init__.py: helpers shared with LazyToolBox. - queue_worker/__init__.py: route ``_get_new_toolbox`` through LazyToolBox when opted in, and add the ``reload_tool_source_cache`` control message used by populate_store --watch to invalidate caches across processes. - app/__init__.py and galaxy_mock.py: build the tool source store at app startup and use LazyToolBox when ``use_lazy_toolbox`` is set.
Replaces ad-hoc iteration over ``app.toolbox`` in the ``/api/tools``, ``/api/tool_panels``, tests-summary, and ``/api/tools/all_requirements`` endpoints with service methods that query the LazyToolBox pre-computed index when available, falling back to the existing toolbox path otherwise. This converts the batch endpoints from O(N tools) to O(1) when the lazy toolbox is in use, without changing behavior for the default eager toolbox.
End-to-end test that boots Galaxy through the composite store + LazyToolBox path and verifies tools are discoverable via /api/tools, covering the wiring exercised by all of the preceding commits.
- doc/source/admin/tool_source_storage.rst: operator-facing guide covering backend choice, populate_store usage, watch mode, and the per-conf composite story. - doc/source/dev/tool_source_storage.rst: developer-facing architecture overview of the store, LazyToolBox, and the batch endpoint integration.
Wires populate_store.py and ``use_lazy_toolbox: true`` into the api and integration GitHub Actions workflows so the LazyToolBox path is exercised in CI alongside the default eager toolbox.
…d YAML-id tools Round out the index discovery so the lazy toolbox boots against the full shipping tool set: - ``scripts/tool_source/_discover.py`` now handles ``<tool_dir>`` panel entries and reads YAML tool ids alongside XML, so multi-version tool directories and YAML tools land in the index. - ``LazyToolBox`` extends the bootstrap path to register every variant discovered for a single tool id (was previously dropping siblings) and uses the shared YAML id reader when chasing direct ``<tool file=...>`` references from the panel config. - ``DatabaseToolSourceStore.store_index`` accepts the richer per-id payload from the discovery script. Test coverage is extended in ``test_tool_source_storage`` to assert that the bootstrapped index sees both directory-based and YAML-defined tools.
Boot was eagerly walking every indexed tool (500+) through panel-view construction, EDAM rendering, and lineage build-up — both slow and prone to leaking caches across the embedded restart used by ``test_recovery``. Defer all of that to first request: - The default tool panel view returns ``self._tool_panel`` unchanged. ``GET /api/tool_panels/default`` is served by an override that builds responses straight from ``ToolIndex.entries`` via ``_index_entry_to_api_dict`` — no Tool instantiation at boot. - Static panel views materialise per-section on first request and cache the result; ``in_panel=True`` listings fall back to ``super().to_dict`` so individual sections are loaded only when consumed. - EDAM panel view rendering is similarly deferred to first request and tracks its own render state, separate from the LRU cache. - A new ``LazyLineageMap`` derives version lists from the index instead of probing every Tool object; entries with the same versionless guid share a single map instance so cross-version lookups don't fall over. - Two ``haltables`` (``tool source store`` and ``lazy toolbox``) clear cached state on shutdown so the second boot of an embedded restart sees a clean slate. Lint/mypy/black fallout from the panel rewrite is folded in.
Support tool shed installs end-to-end on top of the persistent tool source store: - Section-install paths funnel through ``_lazy_register_tool_item`` so each shed install is indexed alongside the panel section it lives in. - Shed entries are keyed by full toolshed guid (not the short id) so workflow-side ``has_tool(guid)`` lookups resolve, and short-id references resolve back to the latest matching guid. - ``tool_shed_repository``/``guid`` metadata is stamped on stored entries and restored on materialisation; cross-version lineage follows the guid so the same lineage covers every installed revision. - ``_prune_orphaned_shed_entries`` drops persisted entries whose backing shed_tool_conf no longer references them, so a ``reset_shed_tools`` test doesn't leak installs into the next test. - Section re-materialisation rebuilds ``section.elems`` (shed tools inserted at section head, ahead of local tools) instead of clearing the existing entries; ``remove_tool_by_id`` clears ``_tool_versions_by_id`` and lineage state under the toolbox lock. - ``reload_tool_source_cache`` broadcasts with ``noop_self=True`` so the local process keeps its just-stored cache instead of racing the queue-worker session. - Configured shed/migrated tool confs are registered as on-demand placeholders when the file is absent at boot, matching the eager toolbox's behaviour.
Round out tool lookup so version-aware callers and short-id callers both see the right Tool object: - ``has_tool``/``get_tool`` honour ``tool_version`` and ``exact`` — prefer the exact requested version, fall back to the default version, then fall back to the latest indexed version when the requested version isn't known. - ``get_all_versions`` sorts via ``packaging.version`` so prefixes like ``"0.1+galaxy6"`` and ``"0.2"`` compare correctly (lexical sort would invert them). - ``ToolIndex`` records the per-id default version so the toolbox can pin selection rather than rely on insertion order; the in-memory index is checked for an exact version match before instantiating the default Tool. - ``packaging`` is imported at module scope and ``get_tool``'s signature is tightened with overloads.
Make the store's commit boundaries explicit so the in-memory toolbox and the persistent index agree across worker boundaries, and shut the store down cleanly before the SQLAlchemy engine goes away: - ``ToolSourceStore`` grows a polymorphic ``commit()`` method (no-op default, composite propagates, database flushes + commits). - ``DatabaseToolSourceStore.close()`` commits pending writes before dropping the cached index so the next boot doesn't read a partial snapshot. - ``LazyToolBox`` commits explicitly at the end of bootstrap, and releases the queue-worker transactions held over ``__init__`` and ``_load_index_from_store`` so peer processes don't see idle-in-transaction rows that block subsequent writes. - ``GalaxyUniverseApplication`` runs the tool source store shutdown before disposing the SQLAlchemy engine so ``commit()`` still has a live session.
Round out the test suite that exercises LazyToolBox end-to-end against a real boot: - Extend ``BaseToolSourceStorageIntegrationTestCase`` helpers to hit ``/api/tools?in_panel=False`` (the index-backed listing) and to reach the lazy ``tools_by_id`` map directly when the resolver path isn't what's being asserted. - New unit tests under ``test/unit/tool_source_store/`` pin the index's version handling: default-version selection, sibling-version discovery, and ``packaging.version`` ordering. Tests that bind specific behaviours to specific LazyToolBox commits land alongside the code commits that introduced them; this commit collects the cross-cutting helpers and pure-unit coverage.
``start_services.sh`` exported the Service ClusterIP and returned immediately after ``kubectl apply``, then the Run tests step started right away. The Postgres pod typically needs 30+ seconds to image-pull and start its listener; runs where Galaxy boots fast enough race it and hit ``connection refused``. When that happens the entire shard errors — every Galaxy boot dies talking to its DB before any test can run. Observed on Integration shard 1 of run 25398574715 (310 errors, 0 passes, first ERROR 35s after the apply). ``kubectl rollout status`` waits for the deployment, then ``kubectl wait Ready`` confirms the pod has actually transitioned out of ContainerCreating before tests start.
``ToolFileWatcher._process_tool_file`` used to write only ``StoredToolSource`` on a content change, leaving the ``ToolIndex`` stale until the next full populator run. After this commit, the watcher delegates to ``populate_for_paths`` once the lightweight hash check confirms the content actually changed — same single-writer entry point shed installs use — so the index, whoosh search, and peer-process reload broadcast all stay in sync with on-disk edits in dev mode. ``ToolFileWatcher`` gains an optional ``sa_session`` arg; ``watch_mode`` passes the freshly-initialised ``model.context``. Callers that pass ``None`` (legacy / test mocks) fall back to the previous store-only behaviour with a one-line "index left stale" log — the contract is preserved for the 25 fake-store unit tests that exercise the watcher without a DB. 69/69 unit tests still pass.
In-process callers (LazyToolBox cold-start, ``populate_for_paths`` on shed install, ``reconcile_index`` for ``reset_shed_tools``) share Galaxy's ``app.model.context`` SQLAlchemy session. The ``DatabaseToolSourceStore.store()`` write path goes through that session, and ``Session`` is not thread-safe — running the ``ThreadPoolExecutor`` with the previous default of 4 workers caused SQLite "database is locked" errors and a full integration suite that took 51 minutes with 13 failures. The CLI is unaffected: ``populate_store(config_file=...)`` constructs its own ``model.context`` via ``init_models_from_config`` (no concurrent readers) and explicitly passes ``parallel=args.parallel`` (defaults to 4), so ``populate_store.py --parallel N`` still fans out.
…oosh ``app.toolbox_search`` becomes the single search singleton for both toolbox flavours. In lazy mode it's :class:`LazyToolboxSearch` — a ``ToolBoxSearch`` subclass that opens the populator-owned whoosh dir (``tool_search_index_dir/_lazy_default``) on every ``search()`` call and no-ops ``build_index`` (the populator does the writes). In eager mode it stays :class:`ToolBoxSearch` as before. Effects: - ``services/tools.py:search_tools`` drops its ``_get_lazy_toolbox`` branch and just calls ``self._search(query, view)``. One code path, regardless of toolbox flavour. - ``LazyToolBox.search_tools``, ``_get_search_index``, the ``_whoosh_search_index`` slot, and the whoosh clear in ``invalidate_index_cache`` all drop. The unused ``ToolSearchTuning`` / ``ToolWhooshIndex`` imports go with them. - ``LazyToolboxSearch`` keeps ``index_count`` so the existing ``queue_worker.rebuild_toolbox_search_index`` watermark check is satisfied without doing work. ``lazy_toolbox.py`` drops a further ~55 lines (1257 → 1202). 69/69 unit tests still pass.
``load_index`` already does this — the comment there explains why: each ``session.execute(select(...))`` on the shared scoped session opens an implicit read transaction that nothing later closes, so on SQLite the read lock blocks subsequent writers (in particular the cold-start populator), and on Postgres it accumulates idle-in-transaction rows. Apply the same ``finally: session.rollback()`` pattern to ``get``, ``exists``, ``get_by_tool_id``, and ``get_by_source_path``. The populator and ``LazyToolBox._index_needs_population`` call these in tight loops at cold boot (484× ``exists`` and 484× ``get_by_source_path`` on a default checkout), so the leak compounded fast: integration tests hit ``OperationalError: database is locked`` on the very next ``SELECT FROM tool_index`` from the test client. ``rollback()`` after a pure read is a no-op for data but closes the transaction.
After the populator/single-writer refactor, ``LazyToolBox.create_tool`` raises ``RuntimeError`` on index miss (commit ``56a66da6a9``). That broke ``load_hidden_lib_tool`` paths: ``set_metadata_tool.xml`` (and other Galaxy-internal tools loaded from ``lib/galaxy/datatypes/`` after boot) live outside any tool_conf and ``discover_tools`` doesn't yield them, so the populator never indexed them. New ``galaxy.tool_source_store.populator.populate_single_path`` parses one tool file by absolute path, builds a ``StoredToolSource`` + ``ToolIndexEntry`` (with no panel section — the tool isn't in any conf), writes it to every writable store, and updates the cached index. Errors are logged and rolled back; the session is left clean. ``LazyToolBox.create_tool`` now retries via ``populate_single_path`` on miss before raising. The single-writer contract is preserved (only the populator writes), the raise still fires when the file genuinely doesn't exist on disk, and the test seam ``test_create_tool_raises_on_index_miss`` still passes because the path ``/tools/unknown.xml`` doesn't exist. Also adds ``LazyTool.tool_shed_repository`` (forwarded from ``_overrides``, defaults to ``None``) so the eager pipeline's shed-metadata setattr doesn't trip the strict ``__getattr__``. Lib tools have no repo; shed tools store theirs via the override. Per-tool commits in ``populate_store_inline``'s store loop release the write transaction between tools so SQLite readers (the test client's ``/api/tools`` request, the queue worker) aren't blocked behind a 484-row write transaction. The error-handling path rolls back the shared session so a flush failure doesn't leak ``PendingRollbackError`` to the next caller.
…_path Three review-driven cleanups on top of commit 7ae5bf4: 1. **``ToolSource.to_string()``** is the canonical serialiser on the parser interface (``lib/galaxy/tool_util/parser/interface.py:450``) and is implemented by every concrete source (XML, YAML, CWL). Use it instead of branching on ``xml_tree`` / falling back to raw file read — fewer special cases, correct for every source class. 2. **Drop ``populate_single_path``.** It was a heavyweight recovery for ``load_hidden_lib_tool`` paths that aren't in any tool_conf. The cleaner answer: on index miss, ``LazyToolBox.create_tool`` delegates to ``super().create_tool(...)``. The eager parent parses the file and registers a real ``Tool`` (not a ``LazyTool``) — lib tools like ``set_metadata_tool.xml`` live in memory, never enter the populator-owned store, and re-load fresh on every boot. The "raise on miss" contract is downgraded to "fall through to eager" — the populator still owns every tool that *should* be in the index, and the test seam (``test_create_tool_falls_through_to_eager_on_index_miss``) asserts the new behaviour. 3. **Hoist inline imports** in both files. Module-level ``StoredToolSource`` / ``ToolSourceStore`` / ``ReadOnlyStoreError`` / ``DiscoveredTool`` / ``discover_tools`` / ``ToolIndex`` / ``ToolIndexEntry`` / ``ToolSection`` / ``populate_store_inline`` / ``get_toolbox_parser`` / ``get_tool_source`` so the structural deps are visible at the top of each module. The heavy CLI-only imports (``galaxy.config``, ``galaxy.model``, ``galaxy.datatypes.registry``, the whoosh ``ToolSearchTuning`` / ``ToolWhooshIndex`` pair) stay inline — they're optional, used only on populator entry points, and would slow ``--help`` if pulled in. 14/16 integration tests pass; the two remaining failures predate this stack (``test_run_specific_version_executes_that_version`` is the noted pre-existing flake, ``test_default_panel_view_section_tools_use_id_list`` needs separate investigation).
``AbstractToolBox._newer_tool`` (lib/galaxy/tool_util/toolbox/base.py:1500) compares ``tool1.version_object > tool2.version_object``; it's reached from ``_lineage_in_panel`` during ``to_panel_view`` rendering. Without the property the strict ``__getattr__`` raised ``NotImplementedError`` for ~hundreds of tools and the API returned 500. Compute it the same way ``Tool.version_object`` does (lib/galaxy/tools/__init__.py:1213) — parse the entry's ``version`` string, special-casing the ``+galaxy`` suffix per PEP-440. ``parse_version`` is already imported at module top.
Galaxy vendors a ``parse_version`` (lib/galaxy/tool_util/version.py:45) that falls back to ``LegacyVersion`` when ``packaging``'s ``Version`` raises ``InvalidVersion`` — the eager toolbox uses it (lib/galaxy/tools/__init__.py:128). The new lazy paths (LazyTool's ``version_object`` / ``_ver_key``, ``ToolIndex.add_entry``'s multi-version tiebreaker) were importing ``packaging.version.parse`` directly, bypassing the LegacyVersion fallback. Swap to the vendored function in three places: - ``lib/galaxy/tools/lazy_toolbox.py`` module-level import. - ``ToolIndex.add_entry``'s inline import. - The matching comment in ``_ver_key``. No behaviour change for versions that parse as PEP-440; the difference shows up on Galaxy-specific version strings that need ``LegacyVersion`` semantics.
…se fallback Two boot-time and panel-render fixes that the integration suite surfaced once cold-start indexing covered every conf-discovered tool: 1. **Version defaulting in ``build_index_entry_from_source``.** Calling ``tool_source.parse_version()`` directly returns ``None`` for tools without an explicit ``<tool version="...">`` (test fixtures like ``sam_to_unsorted_bam.xml`` and ``implicit_conversion.xml``). The eager pipeline routes through ``galaxy.tool_util.parser.util.parse_tool_version_with_defaults`` which falls back to ``"1.0.0"`` on profiles < 16.04. Use the same helper at index-build time so ``LazyTool.version`` is never ``None`` and ``ToolLineage.register_version`` can parse it. 2. **Materialise fallback in ``LazyTool.to_dict(link_details=True)``.** ``to_panel_view`` materialises every panel-visible tool to call ``to_dict(link_details=True)``. Some tools have XML the parameter factory chokes on at materialise time (``upload1``'s ``upload_dataset`` input_type, ``filter_data_table.xml``'s ``column="value"`` dynamic-options filter). The eager toolbox catches these in ``_load_tool_tag_set`` at boot and drops the tool from ``_tools_by_id``; the lazy path postpones the failure to first ``to_dict``, where it now logs a WARNING and falls back to the entry-only fast-path dict so the panel render still completes. 15/16 ``test_tool_source_storage.py`` integration tests pass; the remaining failure is the noted pre-existing ``test_run_specific_version_executes_that_version`` flake.
``galaxy.tools.special_tools`` already enumerates Galaxy-internal tool files loaded outside the conf walk (``load_hidden_lib_tool`` → the four ``imp_exp`` / ``data_fetch`` entries, plus ``set_metadata_tool.xml`` which the datatypes registry loads separately). The lazy refactor was missing index entries for those, so the previous step relaxed ``LazyToolBox.create_tool`` to fall through to the eager parent on miss to keep boot working. Add ``hidden_lib_tool_paths()`` in ``special_tools.py``: returns the absolute paths of every hidden-lib tool, including ``set_metadata_tool.xml``. ``SPECIAL_TOOLS`` (consumed by ``load_lib_tools``) stays unchanged so ``set_metadata_tool`` isn't double-loaded — the new ``_EXTRA_HIDDEN_LIB_TOOLS`` dict only feeds the populator-facing helper. ``galaxy.tool_source_store.discover.discover_tools`` walks those paths after the conf + bundled walks, yielding ``DiscoveredTool`` entries with ``tool_conf="<hidden-lib>"``. The populator's cold-start scan indexes them like any other tool, so the post-boot ``load_hidden_lib_tool`` calls now hit the index branch. With coverage complete, ``LazyToolBox.create_tool`` reverts to the strict raise the plan called for: a miss is now a contract failure (operator forgot to repopulate, or a new ad-hoc tool load didn't get added to ``hidden_lib_tool_paths``). The unit test asserts the raise again. 15/16 integration tests still pass; the remaining failure is the pre-existing flake ``test_run_specific_version_executes_that_version``.
… trip apply_async Every ``@galaxy_task`` relies on ``app.magic_partial(func)`` (in the wrapper itself) to inject DI dependencies — typically ``sa_session: galaxy_scoped_session`` and manager instances declared on the function signature — at task-execution time. Celery's ``apply_async`` runs ``check_arguments(*args, **kwargs)`` against the *original* function signature before the task body fires, and that check has no visibility into magic_partial. Net effect: every Galaxy celery task that takes a DI-injected positional crashes at ``.delay()`` with ``TypeError: <task>() missing 1 required positional argument: 'sa_session'``. The failure mode was hidden on older celery / Python combos because ``Task.typing`` (which gates the pre-call check) interacted differently with the signature ``head_from_fun`` generated. On celery 5.6.x + Python 3.14 it surfaces consistently — most visibly in ``test_run_specific_version_executes_that_version`` where ``set_job_metadata.delay(...)`` doesn't pass ``sa_session`` and the metadata-setting job fails before it runs. Set ``typing=False`` as the default in ``galaxy_task``: every Galaxy celery task already uses DI injection, so the pre-call check would never be correct for them anyway. Individual call sites still pass their non-DI args explicitly; the DI-injected ones are populated by ``magic_partial`` at execution. Callers that want celery's typing check back can pass ``typing=True`` explicitly. Integration suite goes from 15/16 to 16/16 with no other change.
Every inline import in these two files was checked against: 1. Does it create an actual import cycle? (No for any of them.) 2. Is it a soft / optional dependency? (kombu + watchdog are in ``pyproject.toml`` and ``pinned-requirements.txt``; the try/except-ImportError defenses were dead code.) The "lazy import for startup speed" justifications on the CLI entry points (``galaxy.config``, ``galaxy.model``, ``galaxy.datatypes.registry``, ``galaxy.tool_source_store.search``, etc.) were also bogus — the populator module is imported at Galaxy boot via the LazyToolBox cold-start hook, so those heavy modules get pulled in anyway. The inline form only delayed the cost by milliseconds while making the dep graph invisible at the top of the file. Hoisted: - ``lazy_toolbox.py``: ``galaxy.exceptions``, ``galaxy.tool_shed.util.repository_util``, ``galaxy.util.tool_version``. - ``populator.py``: ``kombu``, ``watchdog``, ``galaxy.config``, ``galaxy.datatypes.registry``, ``galaxy.model``, ``galaxy.model.mapping``, ``galaxy.queues``, ``galaxy.tool_source_store.search`` (Tuning/Whoosh), ``galaxy.util.properties``. The watchdog ``ImportError`` defense and the kombu inline block both drop with them. Galaxy boot still works; 113/113 unit tests + 16/16 integration tests pass; CLI ``populate_store.py --help`` still resolves. The ``TYPE_CHECKING`` block in ``lazy_toolbox.py`` (the only remaining function-scope ``from``-import area) keeps its lookup-only imports where they belong.
…y stub ``_create_tool_from_stored_source`` was hitting the install database on every shed-tool materialise to fetch ``ToolShedRepository`` just so the Tool ctor's ``populate_tool_shed_info`` could stamp four scalar fields. The eager pipeline already has a namedtuple stub for exactly this case (``ToolConfRepository`` in ``lib/galaxy/tool_util/toolbox/base.py:87``, used for shed installs whose install-DB row hasn't appeared yet) — build it directly from ``ToolIndexEntry``. Eliminates: - ``_lookup_tool_shed_repository`` method (~30 lines). - ``galaxy.tool_shed.util.repository_util.get_installed_repository`` module-level import. - A DB round-trip per shed-tool materialise. ``installed_tool_dependencies`` readers see ``tool_dependencies_installed_or_in_error=[]`` for materialised lazy tools — same shape the eager pipeline produces when the install-DB row is missing, so downstream callers already tolerate this. If a job runner ever needs real DB-backed dependency info, it can query ``app.install_model`` directly using the scalar shed metadata already on ``Tool``. 16/16 integration tests still pass.
``Registry.load_datatype_converters`` (``lib/galaxy/datatypes/registry.py:674``) walks the converter list from ``datatypes_conf.xml`` after boot and calls ``toolbox.load_tool(config_path)`` per entry. Under strict ``LazyToolBox.create_tool`` every converter raised ``RuntimeError`` (caught + logged by the registry as ``Error loading converter (…)``) and the ``datatype_converters`` dict stayed silently empty — so format-mismatch conversions via ``Dataset.find_conversion_destination`` were no-ops. Fix: in ``discover_tools()``, after the hidden-lib block, query ``galaxy.model._get_datatypes_registry()`` and yield a ``DiscoveredTool`` for each converter the registry has parsed. Same source of truth ``load_datatype_converters`` iterates, so we can't drift — what the registry will try to load is exactly what the populator indexes. The registry is populated by ``set_datatypes_registry()`` at app boot (``app/__init__.py:853``) before the toolbox initialises and inside ``populate_store(config_file=…)`` before ``populate_store_inline`` runs — both paths produce a live registry by the time the populator walks. On the (atypical) case where the registry isn't set, the try/except falls through and the converters remain unindexed; the eager ``load_datatype_converters`` then continues to catch + log the ``Error loading converter`` messages, matching today's pre-refactor behaviour. After this commit, ``Persisted ToolIndex for store __default__`` reports 580 entries (up from 564 = +16 converters that ``sample_tool_conf.xml`` doesn't list) and zero ``Error loading converter`` messages appear in the integration log. 16/16 integration tests + 113/113 unit tests still pass.
CI surfaced six attributes that the tool-execution path
(``/api/tools/{id}`` POST → ``handle_input``) and the job runner read
on the resulting Tool, all of which fundamentally need a parsed
parameter tree:
- ``handle_input`` — entry point for tool execution.
- ``inputs``, ``parameters``, ``new_state`` — the parameter machinery.
- ``input_translator`` — runtime parameter translation.
- ``requires_galaxy_python_environment`` — job-runner environment hint.
Each was raising ``NotImplementedError`` for every tool-execution
request, taking out ~hundreds of API tests at once. Add them to
``_MATERIALIZE_OK`` so the stub materialises on first read and the
real Tool answers the rest of the call.
Materialise on tool execution is exactly the cost model the lazy
toolbox is designed for — pay parse cost only when the tool is
actually used. The stub still surfaces unaccounted-for attribute reads
loudly, which is the contract the user picked.
Round-2 CI surfaced five more attributes the tool-execution path hits before the parameter machinery has finished setting up: - ``check_and_update_param_values`` — parameter validation called from ``handle_input``. - ``wants_params_cleaned`` — parameter scrub before execution. - ``tool_source`` — raw ``ToolSource`` (some callers walk the XML directly). - ``dynamic_tool`` — dynamic-tool linkage on execution. - ``produces_entry_points`` — interactive-tool entry-point lookup. All five materialise correctly when read; adding to ``_MATERIALIZE_OK`` makes the stub's strict ``__getattr__`` route through the cached ``_real`` Tool on first access instead of raising.
Eight more execution-path attrs surfaced by API tests: - ``execute`` — the tool execute method itself. - ``expand_incoming`` — parameter expansion for multi-run / map-over. - ``params_to_strings`` — parameter serialisation. - ``tool_action`` — execution action dispatch. - ``tool_dir`` — on-disk directory accessor (read by some callers). - ``completed_jobs`` — job-search-by-tool API. - ``get_default_history_by_trans`` — default-history lookup during execute. - ``regenerate_imported_metadata_if_needed`` — import/export job path. All execution-path; correct to materialise on first read.
``tool_action`` graduated to ``_MATERIALIZE_OK`` in the previous commit (it's read on the execute path). The strict-raise assertion needs an attribute the stub neither forwards nor materialises; ``totally_not_a_tool_attr`` is guaranteed unaccounted-for.
Seven more execution + introspection-path attrs from the round-5 CI: - ``outputs`` — output spec read by ``execute`` / display formatters. - ``tests`` — tool tests definitions surfaced via ``/api/tools/.../tests``. - ``to_json`` — tool JSON serialisation. - ``tool_requirements_status`` — ``/api/tools/.../requirements_status``. - ``provided_metadata_file`` — job-runner metadata path resolution. - ``test_data_path`` — tool test data lookup. - ``get_configured_job_handler`` — job-runner handler routing. Cascading WorkflowStep ``module`` injection errors in the same log were a downstream effect: workflow steps couldn't resolve their tool modules because the underlying ``LazyTool`` raised on these reads. Materialising lets the injector populate ``step.module`` normally.
``/api/tools/<id>`` show endpoint passes ``io_details=True`` to surface ``inputs`` / ``outputs`` / parameter detail; the previous condition only materialised on ``link_details=True``, so the listing-shape dict came back without inputs and ``test_show_repeat`` / ``test_show_multi_data`` asserted on a missing key. Add ``io_details`` to the materialise predicate. Forward both flags into ``Tool.to_dict`` so the parent picks the right payload variant. The entry-only fast path now only applies to the truly cheap case (the ``/api/tools?in_panel=False`` listing). Materialise failure still falls back to the entry-only dict — listing remains 200 even if a specific tool can't materialise.
The entry-only fast path was missing fields the ``/api/tools/<id>`` show endpoint serves (``xrefs``, ``inputs`` when ``io_details=False``, …). The flat ``/api/tools?in_panel=False`` listing — the only caller that benefits from the fast path — already bypasses this method: ``services/tools.py:list_tools`` walks ``tool_index.list_all()`` and calls ``entry.to_api_dict()`` directly. Always materialise here; the entry-only dict survives as the fallback when materialise itself raises (a tool whose XML the parameter factory chokes on — ``upload_dataset``, ``column="value"``, …) so the caller still gets *something* renderable rather than a 500. Tests updated: - ``test_to_dict_fast_path_does_not_materialise`` → renamed to ``test_to_dict_falls_back_to_entry_dict_when_materialise_fails``. - ``test_to_dict_link_details_materialises_exactly_once`` → renamed for accuracy; behaviour unchanged.
After five rounds of CI surfacing missing stub attrs and the latest round still hitting indirect failures (workflow steps with ``module`` unset because some inner attribute access raised under strict), the explicit ``_MATERIALIZE_OK`` set has stopped converging. The strict guard caught the obvious surface and now slows ship velocity without adding signal. Flip the default: unknown attr reads on ``LazyTool`` now log a WARNING and materialise. The strict raise is still available for debugging via ``LAZY_TOOL_STRICT=1`` — the path that surfaces ``add to the stub surface or _MATERIALIZE_OK`` for new audit work. ``test_strict_getattr_raises_with_clear_message`` now monkey-patches the module flag to force strict mode for the assertion. Other tests are unchanged.
These were emitted on every entry of /api/tools but no client component ever read them. The Vue panel renderer derives the link target from ``model_class == "DataSourceTool"`` via ``getTargetById`` and never reads ``min_width`` at all; the Tool interface declared the fields purely to match the wire shape. ``Tool.target`` / ``Tool.uihints`` likewise had ``to_dict`` as their sole reader in lib/. Removed: - ``self.target`` / ``self.uihints`` attributes and the corresponding ``<inputs target>`` and ``<uihints minwidth>`` reads in ``__parse_legacy_features`` - ``DataSourceTool.parse_inputs`` self.target = "_top" override (dead) - ``min_width`` / ``target`` keys from both the eager ``to_dict`` payload and the ``LazyTool.to_dict`` stub The XSD declarations for ``<uihints minwidth>`` and ``<inputs target>`` stay so existing data-source tool XML continues to validate cleanly.
Companion to the server-side removal in ``Tool.to_dict``. The only component still reading ``item.target`` was ``ToolsListTable.vue``, which used ``target === 'galaxy_main'`` to flag a tool as "local" (driving the wrench-vs-external icon, the External badge, and internal vs ``href`` routing). The server-side equivalent of that flag is ``isinstance(self, DataSourceTool)`` — the only tool class whose ``target`` was ever ``_top`` rather than ``galaxy_main`` — so switch the check to ``model_class !== 'DataSourceTool'``, matching the heuristic already used by ``toolStore.getTargetById``. Removed: - ``min_width`` / ``target`` from the ``Tool`` interface - ``min_width`` / ``target`` keys from the ToolsView test fixture - ``min_width`` / ``target`` from the ``createMockTool`` factory in tool-version.test.ts
After commit ``108529ed9ee`` removed ``min_width`` and ``target``,
``Tool.to_dict``'s ``link_details=True`` branch only added ``link`` — a
two-line ``url_for`` call that returns a stable URL pattern. The
toolbox internals (``get_tool_to_dict``, ``to_dict(in_panel=True)``,
``to_panel_view``, the ``ToolSection`` else-branch in both) all
unconditionally passed ``link_details=True``. Under ``LazyToolBox``
that forced full materialise on every panel-visible top-level tool
just to compute one URL.
Promote ``link`` to an unconditional field in ``Tool.to_dict`` and drop
the ``link_details=True`` from every toolbox-internal caller. The
``link_details`` parameter stays on the signature (still accepted by
``/api/tools/{id}?link_details=…``) but is now a no-op gate — no field
is conditional on it.
Net effect:
- Lazy mode: ``to_panel_view`` no longer materialises top-level tools.
The lazy stub serves ``link`` from its ``id`` in the entry-shape fast
path (restored in a follow-up commit).
- Eager mode: serialisation drops one ``if`` branch per tool. No
visible behaviour change.
321/321 unit tests pass.
With ``link_details`` no longer gating any field, the toolbox-internal
panel / listing callers stopped passing it (``f64858bf55e``).
``LazyTool.to_dict`` can now return the entry-shape dict for the
default-call case — every panel-view tool serves from the index, zero
materialise.
Materialise still fires when the caller asks for the parsed-Tool
payload: ``io_details=True`` (the ``/api/tools/{id}`` show endpoint
contract) or ``tool_help=True`` (rendered help). Failure-to-materialise
still falls back to the entry-shape dict so the show endpoint never
returns 500 when a specific tool's XML chokes the parameter factory.
The ``link`` field is derived from ``self.id`` matching what
``Tool.to_dict`` now emits unconditionally — ``/tool_runner?tool_id={id}``
for the standard tool runner.
Tests updated:
- ``test_to_dict_entry_fast_path_does_not_materialise`` asserts the
default call serves the entry dict and never invokes the callback.
- ``test_to_dict_io_details_materialises`` covers the show contract.
- ``test_to_dict_falls_back_to_entry_when_materialise_fails`` covers the
parameter-factory-failure case.
23/23 unit tests pass.
`LazyTool` materialise during workflow ``inject_all`` calls ``store.get(hash)``; the store ran the SELECT on Galaxy's shared scoped session and then called ``session.rollback()`` to release the implicit read transaction. That rollback expired *every* request-scoped instance attached to the shared session — including the in-flight ``workflow.steps`` collection. The next access in the same ``populate_module_and_state`` re-fetched a fresh ``WorkflowStep`` list from the DB, with ``.module`` unset on the new instances, so ``compute_runtime_state``'s ``assert step.module`` raised ``AttributeError: 'WorkflowStep' object has no attribute 'module'``. Every workflow-invocation test under ``use_lazy_toolbox=true`` 500'd. Fix: read methods on ``DatabaseToolSourceStore`` (``get``, ``exists``, ``get_by_tool_id``, ``get_by_source_path``, ``count``, ``list_all``, ``load_index``) now open a private ``Session`` bound to the same engine and close it on exit. The shared scoped session — and its caller's transaction state — is untouched. The original lock-release intent (cold-start populator, queue-worker init) still holds because each private session's read transaction ends when the session closes. Writes (``store``, ``store_index``, ``delete``, ``update_index_entry``) still use ``_get_session()`` so they participate in the caller's transaction and commit in the right context (populator, queue worker, ``LazyToolBox.__init__``).
The lazy ``to_dict`` fast path was short-circuiting *both* the
panel-view walk (intended) *and* the ``/api/tools/{id}`` show endpoint
(unintended) — they're indistinguishable at the ``to_dict(io_details=
False, link_details=False)`` callsite. ``test_legacy_biotools_xref_
injection`` caught it: the show endpoint returned the entry-shape
dict and no longer carried ``xrefs``.
Split the two call sites at the API rather than the parameter:
* ``Tool.to_panel_entry(trans)`` — new method, returns the cheap
entry-shape dict (id/name/version/description/labels/edam/panel
section/link). ``LazyTool`` overrides it to skip materialise.
* ``AbstractToolBox.get_tool_to_dict`` calls ``to_panel_entry`` when
``tool_help=False`` (the panel-view default); ``tool_help=True``
still falls through to ``to_dict`` so the help payload renders.
* ``LazyTool.to_dict`` always materialises now — it's the
``/api/tools/{id}`` contract and must ship ``xrefs`` / ``versions``
/ ``is_workflow_compatible`` / ``tool_shed_repository`` / etc. The
materialise-failure fallback now reuses ``to_panel_entry``.
Net effect: ``/api/tool_panels/{view}`` and ``/api/tools?in_panel=
true`` stay materialise-free for the lazy toolbox; ``/api/tools/{id}``
serves the full show-endpoint dict in both modes.
I would trade some admin complexity we can hide behind Ansible over pulling the code base in directions we don't want (there was a lot in here I liked and thought were good directions on last review - just not all of it). Admin of Galaxy is complex but the codebase itself is much more complex 😅 IMO - I also think decomposing the API layer is the kind of abstraction we want whereas making the tool APIs more brittle and less functional or reducing the cohesion (not that you're doing that - or the latest version is doing this - but it is my directional concern the API decomposition) is something we don't want IMO. You're scaring me but I trust you. Sorry if it feels like I'm picking on this PR. |
|
It's all still in draft, though i think this version is a good bit better now in terms of the panel handling, and special casing. I don't think this will make it to 26.1, but it seems feasible to have just the toolbox variant and the stub tools and not touch any of the remaining code, keeping all features and APIs intact (we should drop some legacy data like links and min_width). The rest of the code is mostly about resetting the state for integration tests and laying out what will cause full materialisation of a tool. |
Summary
Introduces a pluggable tool source storage layer plus a
LazyToolBoxthat loads tools on demand from that store. The change converts batch tool endpoints from O(N tools) to O(1) on a pre-computed index, supports shipping read-only tool source bundles via a composite of named stores, and is fully opt-in — default deployments keep the existing eagerToolBox.Motivation
Galaxy startup cost and batch endpoint latency both scale linearly with the number of tools. For installs with thousands of tools (and especially for shed-install-heavy deployments) this becomes the dominant cost of
/api/tools,/api/tool_panels, requirements aggregation, and tests-summary. This PR moves the parsed tool sources into a queryable store, builds a lightweight index for batch reads, and lets the toolbox materialize individual
Toolobjects only when actually needed.Key components
tool_source_store/—ToolSourceStoreABC withdatabase,sqlalchemy(for read-only SQLite bundles), andcompositebackends; sharedindexserializer; benchmarks.tool_indextable — pre-computed index used by batch endpoints;LargeBinarywith MySQLLONGBLOBvariant.scripts/tool_source/populate_store.py— walkstool_conffiles, parses each tool source into the store; supports--watchfor incremental updates with cross-process cache invalidation via the queue worker.tools/lazy_toolbox.py— extendsToolBox, materializes tools on demand, LRU-cached bylazy_toolbox_cache_size./api/tools,/api/tool_panels, tests summary, and/api/tools/all_requirementsroute through the index when LazyToolBox is active.apiandintegrationworkflows now also run underuse_lazy_toolbox: true.How tools are loaded and retrieved
flowchart TD subgraph Build["Population (offline / watch mode)"] TC["tool_conf.xml / .yml"] PS["scripts/tool_source/populate_store.py"] DI["_discover.py"] TC --> DI --> PS end subgraph Stores["Tool source storage"] DB[("database backend<br/>tool_source + tool_index")] SA[("sqlalchemy backend<br/>read-only SQLite bundle")] COMP{{"Composite store<br/>read fall-through<br/>writes go to default"}} DB --- COMP SA --- COMP end PS -- "write StoredToolSource" --> COMP subgraph App["Galaxy app startup"] CFG["config_schema.yml<br/>tool_source_store / use_lazy_toolbox"] BUILD["build_tool_source_store"] APP["MinimalGalaxyApplication"] CFG --> BUILD --> APP APP -->|"use_lazy_toolbox=true"| LTB["LazyToolBox<br/>+ LRU cache"] APP -->|"use_lazy_toolbox=false"| ETB["Eager ToolBox"] end COMP -.read.-> LTB subgraph Runtime["Request path"] API["/api/tools, /api/tool_panels,<br/>tests summary, all_requirements"] SVC["ToolsService"] IDX["Pre-computed index<br/>constant-time batch reads"] TOOL["Tool object<br/>materialized on demand"] API --> SVC SVC -->|"batch listing"| IDX SVC -->|"single tool"| LTB LTB -->|"cache miss"| COMP LTB -->|"cache hit"| TOOL COMP -->|"StoredToolSource"| LTB LTB --> TOOL end subgraph Invalidate["Watch mode"] WATCH["populate_store --watch"] QW["queue_worker<br/>reload_tool_source_cache"] WATCH -- "control message" --> QW QW -- "invalidate_index_cache" --> LTB QW -- "invalidate_index_cache" --> COMP endConfiguration
Per-conf opt-in via
store="cvmfs_main"on a<toolbox>element pulls that named store into the composite.Backwards compatibility
use_lazy_toolboxdefaults tofalse, andtool_source_storeinitializes a database-backed store but nothing reads from it unless opted in.Follow-ups deferred from this PR
/api/tool_sources/*and/api/tool_index/*admin/diagnostic endpoints (dropped here — no UI consumer; will land in a follow-up alongside an admin UI).License