Do not attempt to restore an allocation whose directory is missing#27933
Do not attempt to restore an allocation whose directory is missing#27933rodrigol-chan wants to merge 4 commits into
Conversation
516e30d to
b6aff48
Compare
tgross
left a comment
There was a problem hiding this comment.
@rodrigol-chan this is unlikely to be the approach we'd want to take here. If the allocir is gone, it's not going to be enough to re-run task hooks. If the alloc dir is missing, we should be failing these allocations during the "restore" process inclient/client.go so that the entire allocation gets recreated from scratch.
Just generally speaking it's not a good idea from a security perspective to have Nomad touching a directory that the tasks have already been in. And there's no way to detect in this code path whether the allocdir just didn't ever exist or whether it was messed with by the allocation.
|
That makes sense, thanks! Would you prefer if I close this PR and submit a new one with those changes or that I morph this one into failing the allocation? |
|
You can rework this one, that's fine. |
5c25635 to
a5e27b9
Compare
a5e27b9 to
3856eee
Compare
3856eee to
8722bae
Compare
|
Finally got back to this. Fixing the issue required updating all tests that attempt to restore an allocation. I chose to do this in the same commit that would break them but I can split it if you'd like. |
|
Thanks @rodrigol-chan. I'm probably not going to get a chance to review this for a couple days but I wanted to let you know it's on my queue! |
Description
We've recently noticed that tasks would sometimes encounter strange errors after reboot. The reproducer seems to be:
rootraw_exectask.nobodydockertask.nobodydockertask.client.alloc_diron local SSDs. These SSDs are wiped on power-off and when the instance is preempted.data_diris on persistent storage.nobodytasks have no permission to write to their$NOMAD_TASK_DIRbecause$NOMAD_TASK_DIRisroot-owned.I think the issue is that the allocation directory is gone entirely but Nomad apparently doesn't recreate it the same it it creates it. This PR makes Nomad makes Nomad re-run the task hooks if the task directory is missing.
I'm not sure what the correct approach should be here. There are other hooks, both at the task level and the allocation level, but I'm not sure what the consequences are, so this is a minimal fix.
Contributor Checklist
changelog entry using the
make clcommand.ensure regressions will be caught.
and job configuration, please update the Nomad product documentation, which is stored in the
web-unified-docsrepo. Refer to theweb-unified-docscontributor guide for docs guidelines.Please also consider whether the change requires notes within the upgrade
guide. If you would like help with the docs, tag the
nomad-docsteam in this PR.Reviewer Checklist
backporting document.
in the majority of situations. The main exceptions are long-lived feature branches or merges where
history should be preserved.
within the public repository.