Skip to content

Defer dynamic destination evaluation across resubmits#22670

Merged
mvdbeek merged 6 commits into
galaxyproject:devfrom
nuwang:dynamic_destination_resubmit_chaining
May 19, 2026
Merged

Defer dynamic destination evaluation across resubmits#22670
mvdbeek merged 6 commits into
galaxyproject:devfrom
nuwang:dynamic_destination_resubmit_chaining

Conversation

@nuwang
Copy link
Copy Markdown
Member

@nuwang nuwang commented May 10, 2026

Chained dynamic destinations (e.g. gateway_1x -> gateway_1_5x -> gateway_2x, or any TPV setup with destination: tpv_dispatcher on resubmit) only resubmitted once. The resubmit handler in
state_handlers/resubmit.py was eagerly re-evaluating the chain via set_cached_job_destination and caching the resolved bottom. Subsequent failures read resubmit definitions from that bottom, losing any definitions set on intermediate dynamic hops.

Defer evaluation instead. The resubmit handler now persists only the dynamic intent on the job (e.g. destination_id="tpv_dispatcher", job_runner_name="dynamic") and exits without touching the mapper cache. When __handle_waiting_jobs picks the RESUBMITTED job up, __recover_job_wrapper(resubmit=True) calls cache_job_destination(...) to route through __determine_job_destination, which walks the entire chain afresh on every attempt.

Caching invariants are preserved within a single attempt: the mapper still caches the resolved destination on first evaluation and serves all subsequent reads from cache. Cross-attempt, the cache is naturally fresh because each pickup constructs a new JobWrapper/JobRunnerMapper.

Adds an integration test exercising a 3-link dynamic chain (initial_destination -> secondary_destination -> tertiary_destination) to verify multiple resubmits re-walk the chain end-to-end.

Refs: #7118, #15208. Adopts the approach proposed in #9747.

How to test the changes?

(Select all options that apply)

  • I've included appropriate automated tests.
  • This is a refactoring of components with existing test coverage.
  • Instructions for manual testing are as follows:
    1. [add testing steps and prerequisites here if you didn't write automated tests covering all your changes]

License

  • I agree to license these and all my past contributions to the core galaxy codebase under the MIT license.

Chained dynamic destinations (e.g. gateway_1x -> gateway_1_5x -> gateway_2x,
or any TPV setup with `destination: tpv_dispatcher` on resubmit) only
resubmitted once. The resubmit handler in
`state_handlers/resubmit.py` was eagerly re-evaluating the chain via
`set_cached_job_destination` and caching the resolved bottom. Subsequent
failures read resubmit definitions from that bottom, losing any
definitions set on intermediate dynamic hops.

Defer evaluation instead. The resubmit handler now persists only the
dynamic *intent* on the job (e.g. `destination_id="tpv_dispatcher"`,
`job_runner_name="dynamic"`) and exits without touching the mapper
cache. When `__handle_waiting_jobs` picks the RESUBMITTED job up,
`__recover_job_wrapper(resubmit=True)` calls `cache_job_destination(...)`
to route through `__determine_job_destination`, which walks the entire
chain afresh on every attempt.

Caching invariants are preserved within a single attempt: the mapper
still caches the resolved destination on first evaluation and serves
all subsequent reads from cache. Cross-attempt, the cache is naturally
fresh because each pickup constructs a new `JobWrapper`/`JobRunnerMapper`.

Adds an integration test exercising a 3-link dynamic chain
(initial_destination -> secondary_destination -> tertiary_destination)
to verify multiple resubmits re-walk the chain end-to-end.

Refs: galaxyproject#7118, galaxyproject#15208. Adopts the approach proposed in galaxyproject#9747.
nuwang added 3 commits May 11, 2026 00:18
The deferred-evaluation change in the previous commit moved persistence
to the dynamic dispatcher (e.g. tpv_dispatcher with mostly-empty static
params), which had the side effect of overwriting `job.destination_params`
on resubmit. Dynamic rules that branch on prior-attempt context by
reading `job.destination_params` (e.g. TPV's pattern of escalating
memory via `job.destination_params.get("SCALING_FACTOR")`) lost that
context after the first resubmit and produced the same destination on
every attempt.

Carry the prior attempt's `destination_params` forward by merging them
under the resubmit destination's static params before persisting. The
static dispatcher's keys (`function`, `rules_module`, `type`, ...) win
on conflicts so chain re-walk can still find the right rule; everything
else from the prior attempt is preserved for the rule to read.
Five focused tests covering the behaviour introduced in the previous
two commits:

- The handler does NOT call set_cached_job_destination (a regression
  test for the deferred-evaluation invariant).
- It persists the dynamic dispatcher's id/runner as the *intent* on
  the job, not the resolved bottom of the prior chain walk.
- Prior attempt's destination_params are merged into the persisted
  destination so dynamic rules that branch on prior context (e.g.
  TPV's SCALING_FACTOR escalation) keep seeing them.
- Static dispatcher params (function, rules_module, ...) take
  precedence on key conflicts so the chain re-walk picks the right
  rule even if the prior resolved destination accidentally shadowed
  them.
- The __resubmit_delay_seconds flag is persisted alongside the merged
  params so is_ready_for_resubmission picks it up.

Tests use SimpleNamespace and unittest.mock to stand in for the Job
model, JobWrapper, and JobConfig — no DB or Galaxy app required.
The TestJobResubmissionDynamicMultipleIntegration test previously only
asserted the chained dynamic destinations (initial -> secondary ->
tertiary) walked end-to-end without crashing. Extend each rule in
resubmission_rules/rules.py to also read job.destination_params for
a `chain_attempt` counter set by the previous link, and raise
JobMappingException if it does not match the expected value.

This makes the same single integration test simultaneously verify
both behaviours of this PR:

1. Chain re-walk: every resubmit re-evaluates from the persisted
   dynamic intent rather than reusing the cached resolved destination
   of the prior attempt.
2. destination_params carry-forward: the prior attempt's resolved
   destination_params survive the resubmit handler so the next rule
   in the chain can read them on pickup.

A regression in either property surfaces as a JobMappingException
inside the rule, failing the integration test rather than producing
a silently-wrong result.

Verified passing locally (47s) alongside the other dynamic
resubmission integration tests:
- TestJobResubmissionDynamicIntegration::test_dynamic_resubmission
- TestJobResubmissionSmallMemoryResubmitsToLargeIntegration::test_dynamic_resubmission
Comment thread test/integration/resubmission_rules/rules.py Outdated
Comment thread test/unit/app/jobs/test_resubmit_state_handler.py Outdated
Remove redundant unit tests
Move import
@@ -0,0 +1,41 @@
<?xml version="1.0"?>
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bit of a nitpick but i like to pick up things from integration tests when i actually set up a server config. Could you write this in yaml to ease that ?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Copy Markdown
Member

@mvdbeek mvdbeek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome, thanks for working that out!

@nuwang nuwang marked this pull request as ready for review May 19, 2026 10:34
@github-actions github-actions Bot added this to the 26.1 milestone May 19, 2026
@nuwang nuwang marked this pull request as draft May 19, 2026 10:35
Address PR review feedback (galaxyproject#22670): integration
test job configs are sometimes lifted into real server configs, so YAML
is preferred over XML for readability.

Replaces resubmission_dynamic_multiple_job_conf.xml with an equivalent
.yml fixture and updates the test reference. No behavioural change;
the TestJobResubmissionDynamicMultipleIntegration test still passes
(verified locally, 22.7s).
@nuwang nuwang marked this pull request as ready for review May 19, 2026 10:40
@nuwang
Copy link
Copy Markdown
Member Author

nuwang commented May 19, 2026

Companion PR in TPV: galaxyproject/total-perspective-vortex#195 was tested with these changes, and the tests pass.

@mvdbeek mvdbeek merged commit 74b04d4 into galaxyproject:dev May 19, 2026
59 checks passed
@github-project-automation github-project-automation Bot moved this from Needs Review to Done in Galaxy Dev - weeklies May 19, 2026
@mvdbeek
Copy link
Copy Markdown
Member

mvdbeek commented May 19, 2026

Thank you @nuwang !

@nuwang nuwang deleted the dynamic_destination_resubmit_chaining branch May 19, 2026 15:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Development

Successfully merging this pull request may close these issues.

3 participants