Skip to content

Do not attempt to restore an allocation whose directory is missing#27933

Open
rodrigol-chan wants to merge 4 commits into
hashicorp:mainfrom
rodrigol-chan:rzl/rerun-hooks-if-alloc-dir-is-missing
Open

Do not attempt to restore an allocation whose directory is missing#27933
rodrigol-chan wants to merge 4 commits into
hashicorp:mainfrom
rodrigol-chan:rzl/rerun-hooks-if-alloc-dir-is-missing

Conversation

@rodrigol-chan
Copy link
Copy Markdown
Contributor

@rodrigol-chan rodrigol-chan commented May 8, 2026

Description

We've recently noticed that tasks would sometimes encounter strange errors after reboot. The reproducer seems to be:

  • We have a system job comprising:
    • One one-shot root raw_exec task.
    • One sidecar nobody docker task.
    • One main nobody docker task.
  • Our node pool is mostly Google Cloud preemptible instances.
  • For performance reasons, we place client.alloc_dir on local SSDs. These SSDs are wiped on power-off and when the instance is preempted.
  • The client's data_dir is on persistent storage.
  • After the machine is preempted, comes back up, and attempts to restart the allocation, the nobody tasks have no permission to write to their $NOMAD_TASK_DIR because $NOMAD_TASK_DIR is root-owned.

I think the issue is that the allocation directory is gone entirely but Nomad apparently doesn't recreate it the same it it creates it. This PR makes Nomad makes Nomad re-run the task hooks if the task directory is missing.

I'm not sure what the correct approach should be here. There are other hooks, both at the task level and the allocation level, but I'm not sure what the consequences are, so this is a minimal fix.

Contributor Checklist

  • Changelog Entry If this PR changes user-facing behavior, please generate and add a
    changelog entry using the make cl command.
  • Testing Please add tests to cover any new functionality or to demonstrate bug fixes and
    ensure regressions will be caught.
  • Documentation If the change impacts user-facing functionality such as the CLI, API, UI,
    and job configuration, please update the Nomad product documentation, which is stored in the
    web-unified-docs repo. Refer to the web-unified-docs contributor guide for docs guidelines.
    Please also consider whether the change requires notes within the upgrade
    guide
    . If you would like help with the docs, tag the nomad-docs team in this PR.

Reviewer Checklist

  • Backport Labels Please add the correct backport labels as described by the internal
    backporting document.
  • Commit Type Ensure the correct merge method is selected which should be "squash and merge"
    in the majority of situations. The main exceptions are long-lived feature branches or merges where
    history should be preserved.
  • Enterprise PRs If this is an enterprise only PR, please add any required changelog entry
    within the public repository.

@rodrigol-chan rodrigol-chan requested review from a team as code owners May 8, 2026 09:26
@rodrigol-chan rodrigol-chan force-pushed the rzl/rerun-hooks-if-alloc-dir-is-missing branch from 516e30d to b6aff48 Compare May 8, 2026 11:19
Copy link
Copy Markdown
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rodrigol-chan this is unlikely to be the approach we'd want to take here. If the allocir is gone, it's not going to be enough to re-run task hooks. If the alloc dir is missing, we should be failing these allocations during the "restore" process inclient/client.go so that the entire allocation gets recreated from scratch.

Just generally speaking it's not a good idea from a security perspective to have Nomad touching a directory that the tasks have already been in. And there's no way to detect in this code path whether the allocdir just didn't ever exist or whether it was messed with by the allocation.

@rodrigol-chan
Copy link
Copy Markdown
Contributor Author

That makes sense, thanks! Would you prefer if I close this PR and submit a new one with those changes or that I morph this one into failing the allocation?

@tgross
Copy link
Copy Markdown
Member

tgross commented May 8, 2026

You can rework this one, that's fine.

@rodrigol-chan rodrigol-chan force-pushed the rzl/rerun-hooks-if-alloc-dir-is-missing branch from 5c25635 to a5e27b9 Compare May 12, 2026 08:40
@rodrigol-chan rodrigol-chan changed the title Rerun task dir hooks if task directory is missing Do not attempt to restore an allocation whose directory is missing May 12, 2026
@rodrigol-chan rodrigol-chan force-pushed the rzl/rerun-hooks-if-alloc-dir-is-missing branch from a5e27b9 to 3856eee Compare May 19, 2026 14:36
@rodrigol-chan rodrigol-chan force-pushed the rzl/rerun-hooks-if-alloc-dir-is-missing branch from 3856eee to 8722bae Compare May 19, 2026 14:55
@rodrigol-chan
Copy link
Copy Markdown
Contributor Author

Finally got back to this. Fixing the issue required updating all tests that attempt to restore an allocation. I chose to do this in the same commit that would break them but I can split it if you'd like.

@rodrigol-chan rodrigol-chan requested a review from tgross May 19, 2026 15:34
@tgross
Copy link
Copy Markdown
Member

tgross commented May 20, 2026

Thanks @rodrigol-chan. I'm probably not going to get a chance to review this for a couple days but I wanted to let you know it's on my queue!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Development

Successfully merging this pull request may close these issues.

3 participants