Skip to content

Commit 7db70bd

Browse files
committed
refactor(hooks): atomic staging-based lib swap + drop redundant chmod (#1490)
Addresses two EVAL findings from the PR #1490 review: **HIGH #1 — race window during lib sync** The previous implementation was rmtree-then-copytree: if target_lib.exists(): shutil.rmtree(target_lib) shutil.copytree(source_lib, target_lib, ...) which left ``target_lib`` missing for the full duration of the copytree (hundreds of milliseconds for a multi-megabyte lib). Any concurrent HUD subprocess that statted the installed script during that window would see ``ImportError`` and render the fallback face — ironic given this PR exists to fix exactly that symptom. Replaced with a staging/rename pattern: 1. ``copytree(source_lib, .lib.staging-<uuid>, ignore=...)`` 2. ``rename(target_lib, .lib.old-<uuid>)`` ← atomic syscall 3. ``rename(.lib.staging-<uuid>, target_lib)`` ← atomic syscall 4. ``rmtree(.lib.old-<uuid>)`` in ``finally`` The window during which ``target_lib`` does not exist now shrinks from ``O(copytree)`` to ``O(rename)`` — effectively atomic on POSIX. Concurrent installers from multiple simultaneous Claude Code sessions are safe because each uses a uuid-scoped staging dir. Rollback: if ``copytree`` raises mid-sync the except block removes the staging dir and restores ``target_lib`` from the archive, so an existing working install is never lost to a partial sync. **HIGH #2 — redundant chmod after rename in _install_hook_with_lib** ``_atomic_sync_with_lib`` already chmod's the synced script to 0o755 before ``_install_hook_with_lib`` renames it to HOOK_FILENAME, and POSIX rename preserves permission bits. The second chmod was defensive but triggered an unnecessary silent-failure path on filesystems that reject mode changes (NFS, read-only mounts). Removed with a code comment explaining why. **New regression tests (+2)**: - ``test_no_staging_leftovers_after_success`` — no stray ``.lib.staging-*`` / ``.lib.old-*`` dirs after a successful sync - ``test_rollback_preserves_old_lib_when_copytree_fails`` — monkeypatches ``shutil.copytree`` to raise OSError, then asserts the pre-existing target lib survives and no staging/archive dirs leak Local verification: 105 pytest suites pass (100 + 5 from prior commit + 2 new race-safety gates). Refs #1490
1 parent 903aded commit 7db70bd

2 files changed

Lines changed: 142 additions & 22 deletions

File tree

packages/claude-code-plugin/hooks/session-start.py

Lines changed: 79 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -426,50 +426,99 @@ def _atomic_sync_with_lib(
426426
(#1490) where renamed/removed modules from prior plugin versions
427427
remained in the target directory and caused import failures.
428428
429-
Behavior:
429+
Behavior (3-phase, near-atomic lib swap):
430430
1. ``mkdir -p target_dir``
431431
2. Copy ``source_script`` to ``target_dir/<source_script.name>``
432-
and ``chmod 0o755``
433-
3. If ``source_script.parent / "lib"`` exists, **rmtree** any
434-
existing ``target_dir/lib`` and then ``copytree`` the source
435-
lib so renamed modules cannot linger.
436-
437-
Why rmtree-then-copytree (and not ``dirs_exist_ok=True``):
438-
``dirs_exist_ok=True`` only writes; it does not remove files
439-
that existed before but are gone now. A renamed module
432+
and ``chmod 0o755``.
433+
3. If ``source_script.parent / "lib"`` exists:
434+
a. Copy it into a staging directory ``.lib.staging-<uuid>``
435+
b. Rename the existing ``lib`` aside to ``.lib.old-<uuid>``
436+
c. Rename the staging directory to ``lib`` (POSIX ``rename``
437+
is atomic, so the window during which importers see no
438+
lib is bounded by two ``rename`` syscalls — microseconds)
439+
d. ``rmtree`` the archived old lib
440+
441+
Why staging instead of rmtree-then-copytree:
442+
The naive pattern ``rmtree(target_lib); copytree(source_lib, target_lib)``
443+
leaves the target empty for the duration of ``copytree`` (which
444+
can be hundreds of milliseconds for a multi-megabyte lib). Any
445+
concurrent HUD subprocess that statted the script during that
446+
window would see import failures and render the fallback face.
447+
The staging/rename pattern shrinks the window from ``O(copytree)``
448+
to ``O(rename)`` — effectively atomic. It also tolerates
449+
concurrent installers from multiple simultaneous Claude Code
450+
sessions because each uses a uuid-scoped staging directory.
451+
452+
Why not ``dirs_exist_ok=True`` on its own:
453+
That mode only writes; it does not remove files that existed
454+
before but are gone now. A renamed module
440455
(e.g. ``hud_old.py`` → ``hud_new.py``) would remain in the
441456
target lib and could be imported first, causing subtle
442-
regressions. session-start runs once per Claude Code session,
443-
so the cost of the additional rmtree is negligible.
457+
regressions. The staging pattern gives the same stale-safety
458+
guarantees with atomicity on top.
444459
445460
Args:
446461
source_script: Path to the script file to install. Its parent
447462
directory is searched for a sibling ``lib/`` to mirror.
448463
target_dir: Destination directory. Created if missing.
449464
extra_ignore: Additional ignore-pattern tuple appended to the
450-
shared :data:`_HUD_SYNC_IGNORE_PATTERNS` list.
465+
shared :data:`_HUD_SYNC_IGNORE_PATTERNS` list. Reserved for
466+
future wave-specific excludes; both production callers
467+
currently pass ``None``. Runtime lib modules are assumed
468+
to follow the ``hud_*`` / ``tiny_actor_*`` naming
469+
convention — do NOT add runtime helpers named
470+
``test_*.py`` or the ignore filter will drop them.
451471
"""
472+
import uuid
473+
452474
target_dir.mkdir(parents=True, exist_ok=True)
453475

454476
# 1. Script
455477
target_script = target_dir / source_script.name
456478
shutil.copy(source_script, target_script)
457479
target_script.chmod(0o755)
458480

459-
# 2. Lib (atomic replace)
481+
# 2. Lib — staging/rename pattern for near-atomic swap
460482
source_lib = source_script.parent / "lib"
461-
if source_lib.is_dir():
462-
target_lib = target_dir / "lib"
463-
if target_lib.exists():
464-
shutil.rmtree(target_lib)
465-
ignore_patterns = _HUD_SYNC_IGNORE_PATTERNS
466-
if extra_ignore:
467-
ignore_patterns = ignore_patterns + tuple(extra_ignore)
483+
if not source_lib.is_dir():
484+
return
485+
486+
target_lib = target_dir / "lib"
487+
ignore_patterns = _HUD_SYNC_IGNORE_PATTERNS
488+
if extra_ignore:
489+
ignore_patterns = ignore_patterns + tuple(extra_ignore)
490+
491+
suffix = uuid.uuid4().hex[:8]
492+
staging_lib = target_dir / f".lib.staging-{suffix}"
493+
archive_lib = target_dir / f".lib.old-{suffix}"
494+
495+
try:
468496
shutil.copytree(
469497
source_lib,
470-
target_lib,
498+
staging_lib,
471499
ignore=shutil.ignore_patterns(*ignore_patterns),
472500
)
501+
# Atomic archive + promote. Each rename is a single syscall,
502+
# so the window during which ``target_lib`` does not exist is
503+
# bounded by one ``rename`` (~microseconds).
504+
if target_lib.exists():
505+
os.rename(str(target_lib), str(archive_lib))
506+
os.rename(str(staging_lib), str(target_lib))
507+
except Exception:
508+
# Leave target_lib in the best recoverable state: prefer the
509+
# old lib (from archive) over nothing. Staging is always
510+
# disposable.
511+
if staging_lib.exists():
512+
shutil.rmtree(staging_lib, ignore_errors=True)
513+
if archive_lib.exists() and not target_lib.exists():
514+
try:
515+
os.rename(str(archive_lib), str(target_lib))
516+
except OSError:
517+
pass
518+
raise
519+
finally:
520+
if archive_lib.exists():
521+
shutil.rmtree(archive_lib, ignore_errors=True)
473522

474523

475524
def _install_hook_with_lib(
@@ -484,6 +533,13 @@ def _install_hook_with_lib(
484533
that canonical name (``codingbuddy-mode-detect.py``) rather than
485534
the source name (``user-prompt-submit.py``).
486535
536+
Note on permission bits: ``_atomic_sync_with_lib`` has already
537+
chmod'd the synced script to ``0o755`` before we rename it, and
538+
POSIX ``rename`` preserves permission bits across the move. We
539+
therefore do not re-chmod after the rename; a redundant ``chmod``
540+
there would trigger an unnecessary silent failure path on
541+
filesystems that reject mode changes (NFS, read-only mounts).
542+
487543
Args:
488544
source_file: Path to the source hook script.
489545
hooks_dir: Target directory (e.g. ``~/.claude/hooks/``).
@@ -495,7 +551,8 @@ def _install_hook_with_lib(
495551
if target_file.exists():
496552
target_file.unlink()
497553
synced.rename(target_file)
498-
target_file.chmod(0o755)
554+
# No chmod here: _atomic_sync_with_lib already set 0o755 on
555+
# ``synced`` and ``rename`` preserves mode.
499556

500557

501558
CODINGBUDDY_MCP_ENTRY = {

packages/claude-code-plugin/tests/test_atomic_sync_with_lib.py

Lines changed: 63 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -199,3 +199,66 @@ def test_extra_ignore_argument_is_honored(
199199
target_lib = target_dir / "lib"
200200
assert (target_lib / "mod_a.py").is_file()
201201
assert not (target_lib / "shared_helper.py").exists()
202+
203+
def test_no_staging_leftovers_after_success(
204+
self, fake_source_with_lib, target_dir
205+
):
206+
"""Staging/archive directories must be cleaned up on success.
207+
208+
Regression gate for the v5.6.2 atomic-swap refactor: after a
209+
successful sync, the parent directory must NOT contain any
210+
stray ``.lib.staging-*`` or ``.lib.old-*`` entries, or the
211+
next session-start would race on them.
212+
"""
213+
session_start._atomic_sync_with_lib(fake_source_with_lib, target_dir)
214+
leftover_staging = list(target_dir.glob(".lib.staging-*"))
215+
leftover_old = list(target_dir.glob(".lib.old-*"))
216+
assert leftover_staging == [], (
217+
f"staging directories leaked: {leftover_staging}"
218+
)
219+
assert leftover_old == [], (
220+
f"archive directories leaked: {leftover_old}"
221+
)
222+
assert (target_dir / "lib" / "mod_a.py").is_file()
223+
224+
def test_rollback_preserves_old_lib_when_copytree_fails(
225+
self, tmp_path, monkeypatch
226+
):
227+
"""If copytree fails mid-sync, the existing lib must survive.
228+
229+
Simulates a source lib whose copytree raises partway through
230+
and asserts that the pre-existing target_lib is NOT lost.
231+
Protects users from losing a working HUD install if a future
232+
plugin ships a broken source tree or the disk fills up.
233+
"""
234+
src_hooks = tmp_path / "src_hooks"
235+
src_hooks.mkdir()
236+
(src_hooks / "script.py").write_text("# src")
237+
src_lib = src_hooks / "lib"
238+
src_lib.mkdir()
239+
(src_lib / "mod_a.py").write_text("VAL_A = 1")
240+
241+
target = tmp_path / "target"
242+
target.mkdir()
243+
target_lib = target / "lib"
244+
target_lib.mkdir()
245+
(target_lib / "prior_mod.py").write_text("# survivor")
246+
247+
real_copytree = shutil.copytree
248+
249+
def flaky_copytree(*args, **kwargs):
250+
raise OSError("simulated disk full")
251+
252+
monkeypatch.setattr(session_start.shutil, "copytree", flaky_copytree)
253+
254+
with pytest.raises(OSError, match="simulated disk full"):
255+
session_start._atomic_sync_with_lib(
256+
src_hooks / "script.py", target
257+
)
258+
259+
# Old lib must survive
260+
assert target_lib.exists()
261+
assert (target_lib / "prior_mod.py").read_text() == "# survivor"
262+
# No leaked staging or archive dirs
263+
assert list(target.glob(".lib.staging-*")) == []
264+
assert list(target.glob(".lib.old-*")) == []

0 commit comments

Comments
 (0)