Commit ea00e6c
Fix Custodian only killing MPI parent process but leaving VASP ghost processes blocking nodes (#414)
* fix(vasp): kill entire process group when terminating VASP jobs
Use os.killpg() to send SIGTERM/SIGKILL to the entire process group
instead of just the main process. This ensures all MPI child processes
are properly terminated when custodian needs to kill a VASP job.
- Import signal module
- Use os.killpg() with SIGTERM for graceful termination
- Use os.killpg() with SIGKILL for force kill after timeout
- Handle ProcessLookupError when process group is already dead
* test(vasp): refactor VaspJob.terminate() tests for process group killing
Consolidate 13 tests across 3 classes into 9 focused tests in 1 class,
reducing ~180 lines to ~95 lines while maintaining full coverage.
Key changes:
- Use pytest-style class with fixture instead of unittest setUp/tearDown
- Use SimpleNamespace for cleaner mock access (mocks.process vs mocks["process"])
- Mock os.killpg and os.getpgid to test new process group termination logic
Tests now cover all code paths in terminate():
1. test_terminate_process_already_finished - early return when poll() != None
2. test_terminate_graceful_success - SIGTERM succeeds, verifies wait(timeout=10)
3. test_terminate_force_kill_after_timeout - SIGTERM times out, SIGKILL + kill()
4. test_terminate_oserror_during_sigterm - OSError propagates from killpg
5. test_terminate_oserror_during_wait - OSError propagates from final wait()
6. test_terminate_process_lookup_error_during_sigterm - graceful handling
7. test_terminate_process_lookup_error_during_sigkill - falls back to kill()
8. test_terminate_multiple_calls - second call returns early
9. test_terminate_integration_with_real_process - verifies real process group killed
Removed duplicate tests:
- test_terminate_wait_with_different_timeout_behavior (same as force_kill test)
- test_terminate_kills_process_group_with_sigterm (same as graceful_success)
- test_terminate_force_kills_process_group_with_sigkill (same as force_kill)
- Redundant integration test (kept the one verifying process group is dead)
* Add configurable terminate_timeout and remove unused function
- Add terminate_timeout parameter to VaspJob (default 10s)
- Large MPI jobs can use longer timeouts before SIGKILL escalation
- Log message now includes timeout value for debugging
- Remove unused copy_contcar_to_poscar_if_valid() from utils.py
- Add test_custom_timeout to verify configurable timeout works
* Skip terminate tests on Windows (os.killpg/getpgid are POSIX-only)
- Code now falls back to process.terminate() on Windows
- Tests use create=True to mock POSIX-only functions on Windows
- Only integration test with real subprocess skipped on Windows
* apply @Andrew-S-Rosen's suggested `terminate` changes and update test expectations
* try fix windows CI
* VaspJob.terminate() with signal escalation + added back wait confirmation
- Add wait after SIGTERM to confirm process death before returning
- Use try/except/else pattern instead of sigterm_sent flag for cleaner flow
- Skip wait and go straight to SIGKILL if SIGTERM raises OSError
- Use terminate_timeout consistently (remove hardcoded 5s for SIGKILL)
- Catch OSError specifically instead of broad Exception
- Add ProcessLookupError handling for race conditions
- Fall back to parent process kill on Windows or if process group methods fail
- Add type hint to directory parameter and document it as unused
- Distinguish "killed with SIGKILL" from fallback "killed" in logs
- Patch custodian.vasp.jobs.os.* in tests to ensure POSIX path is tested on all platforms
- Add test for SIGTERM OSError skipping wait and going to SIGKILL
Flow: SIGTERM → wait(timeout) → SIGKILL → wait(timeout) → fallback terminate/kill
* try fix flaky CI
* return on ProcessLookupError after SIGKILL
* revert flaky test fix causing slow CI
* try fix windows CI (again)
* Update logging for process termination signals
Clarify the logging when a `ProcessLookupError` is hit
* Update warning messages for process group handling
---------
Co-authored-by: Andrew S. Rosen <asrosen93@gmail.com>1 parent 5767f84 commit ea00e6c
3 files changed
Lines changed: 206 additions & 166 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
4 | 4 | | |
5 | 5 | | |
6 | 6 | | |
| 7 | + | |
7 | 8 | | |
8 | 9 | | |
9 | 10 | | |
| |||
84 | 85 | | |
85 | 86 | | |
86 | 87 | | |
| 88 | + | |
87 | 89 | | |
88 | 90 | | |
89 | 91 | | |
| |||
142 | 144 | | |
143 | 145 | | |
144 | 146 | | |
| 147 | + | |
| 148 | + | |
| 149 | + | |
145 | 150 | | |
146 | 151 | | |
147 | 152 | | |
| |||
156 | 161 | | |
157 | 162 | | |
158 | 163 | | |
| 164 | + | |
159 | 165 | | |
160 | 166 | | |
161 | 167 | | |
| |||
703 | 709 | | |
704 | 710 | | |
705 | 711 | | |
706 | | - | |
707 | | - | |
| 712 | + | |
| 713 | + | |
| 714 | + | |
| 715 | + | |
| 716 | + | |
| 717 | + | |
| 718 | + | |
| 719 | + | |
| 720 | + | |
| 721 | + | |
| 722 | + | |
| 723 | + | |
| 724 | + | |
| 725 | + | |
708 | 726 | | |
709 | | - | |
| 727 | + | |
710 | 728 | | |
711 | 729 | | |
| 730 | + | |
| 731 | + | |
| 732 | + | |
| 733 | + | |
| 734 | + | |
| 735 | + | |
| 736 | + | |
| 737 | + | |
| 738 | + | |
| 739 | + | |
| 740 | + | |
| 741 | + | |
| 742 | + | |
| 743 | + | |
| 744 | + | |
| 745 | + | |
| 746 | + | |
| 747 | + | |
| 748 | + | |
| 749 | + | |
| 750 | + | |
| 751 | + | |
| 752 | + | |
| 753 | + | |
| 754 | + | |
| 755 | + | |
| 756 | + | |
| 757 | + | |
| 758 | + | |
| 759 | + | |
| 760 | + | |
| 761 | + | |
| 762 | + | |
| 763 | + | |
| 764 | + | |
| 765 | + | |
| 766 | + | |
| 767 | + | |
| 768 | + | |
| 769 | + | |
| 770 | + | |
| 771 | + | |
| 772 | + | |
| 773 | + | |
| 774 | + | |
| 775 | + | |
712 | 776 | | |
713 | | - | |
714 | 777 | | |
715 | | - | |
| 778 | + | |
| 779 | + | |
716 | 780 | | |
717 | | - | |
718 | 781 | | |
719 | 782 | | |
| 783 | + | |
720 | 784 | | |
721 | 785 | | |
722 | 786 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
7 | 7 | | |
8 | 8 | | |
9 | 9 | | |
10 | | - | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
11 | 19 | | |
12 | 20 | | |
13 | 21 | | |
| |||
47 | 55 | | |
48 | 56 | | |
49 | 57 | | |
50 | | - | |
| 58 | + | |
51 | 59 | | |
52 | 60 | | |
53 | 61 | | |
| |||
0 commit comments