Skip to content

feat(sync): detect signature drift in rsync and revert/drop#235

Open
mosheabr wants to merge 6 commits into
mainfrom
draft/sig-drift-detection
Open

feat(sync): detect signature drift in rsync and revert/drop#235
mosheabr wants to merge 6 commits into
mainfrom
draft/sig-drift-detection

Conversation

@mosheabr
Copy link
Copy Markdown
Collaborator

@mosheabr mosheabr commented Jun 3, 2026

Summary

Adds a "Detect signature drift in rsync" step to sync-skills.yml that catches the case where the source-repo team edits skill content without refreshing skill.oms.sig. Recovery is automatic and safe:

  • Existing skills: revert the catalog dir to its pre-rsync (signed) state. The signed copy stays live on main and across downstream channels (Skills.sh, Codex, Claude Code, Hermes, OpenClaw). The unsigned update is held until the source team re-signs.
  • New skills (no prior version on main): drop outright — never land in a drift state.

Both write to /tmp/dropped-skills.txt with a sig drift: reason prefix. The sync PR body and the rolling missing-compliance tracker issue now partition drops into Missing artifacts / Signature drift / Orphan / Other sections so each failure mode has its own recovery path.

Why now

We've hand-cherry-picked three sync PRs this week (#226#231, #232#233, #234→pending) because source-repo teams edit skill content without re-running /nvskills-ci. The unsigned content keeps reaching the bot sync PR; landing it would propagate to Skills.sh + Codex + Hermes + OpenClaw within ~15 minutes and break crypto verification on the next sweep. Automating the drop turns this into a self-service flow: the source team gets pinged, the catalog never goes through a broken-crypto state.

How drift is detected

For each catalog dir the rsync actually targeted (recorded in /tmp/rsynced-skill-dirs.txt during the rsync loop, so we never path-parse):

  1. Compute the set of files changed via git diff --name-only HEAD ∪ git ls-files --others --exclude-standard — captures both modifications/deletions and untracked additions.
  2. Check whether the root $dir/skill.oms.sig is in that set (exact match via grep -Fx --, not regex — nested sig files like $dir/docs/skill.oms.sig don't mask a stale root).
  3. If non-sig files changed but the root sig didn't → drift.
  4. Existing skill (git ls-tree HEAD -- "$dir" non-empty) → revert via git checkout HEAD -- "$dir/" + git clean -fdx -- "$dir/". New skill → rm -rf.

Repo-owner visibility

The rolling missing-compliance tracker issue is now auto-assigned (mosheabr by default; configurable via CATALOG_TRACKER_ASSIGNEES secret as a comma-separated list). Every drift change generates a GitHub notification regardless of repo-watch state. Idempotent — re-asserts on each run so manual unassign can't permanently mute notifications.

Test cases the design has been validated against

  1. Skill unchanged in rsync → skipped (no diff).
  2. Skill content + sig both changed → no drift (both moved together).
  3. Skill content changed, root sig unchanged → drift detected. Existing → revert; new → drop.
  4. Stray nested sig refresh (docs/skill.oms.sig) → still drift (root path is what matters).
  5. Multi-skill product with one drifted sibling → drift on that sibling only.
  6. New skill under an existing product → correctly classified as new (doesn't false-revert siblings).
  7. Rsync deletes SKILL.md without sig refresh → still caught (driver iterates exact rsync targets, not post-rsync find SKILL.md).
  8. Catalog dir with whitespace → newline-safe loop handles it.

Bonus fix in this PR

The pre-existing grep -qx "- $product" at the "mark changed components" line parsed the leading - as a grep option, so the de-dup check silently always fell through. Fixed in both the new drift step and the existing compliance step (grep -Fqx --). Happy to split into a separate fix-bug PR if reviewers prefer.

Follow-ups (intentionally out of scope here)

  1. Source-author notification. Auto-file a tracker issue on the source repo @-mentioning the PR author whose merge introduced the drifted content (resolved via git log -1 --pretty='%ae' -- <changed-files> on the source clone we already have during rsync). Anchored to commit identity, not a static contacts directory — survives team rotations. Tracked as backlog item CI Pipeline to verify skill counts match source repos #3.
  2. True crypto verification. This step is a file-level heuristic (changed-set membership vs root sig). Cryptographic model_signing verify is the job of the "Verify Signed Skills" workflow in PR ci: add Verify Signed Skills workflow #78 — the two are complementary (this catches authoring mistakes at sync time; ci: add Verify Signed Skills workflow #78 catches cryptographic mismatches periodically).

Review history

4 rounds of Codex review on a private draft branch (mosheabr/skills-drafts#1) prior to raising upstream. All findings addressed before this PR opened.

Stats

  • 1 file changed: .github/workflows/sync-skills.yml
  • ~230 lines added / 30 modified
  • 6 commits
  • YAML validates clean

mosheabr added 6 commits June 3, 2026 12:09
Adds a new "Detect signature drift in rsync" step to sync-skills.yml
that runs before the existing compliance check. Detects when the
rsync brought in file changes under skills/<dir>/ without a matching
skill.oms.sig refresh — landing those would break crypto verification
on the next "Verify Signed Skills" sweep.

Recovery rules:
- Skill already on main: revert the catalog dir to its pre-rsync
  (signed) state. The signed copy stays live; the unsigned update
  is held until the source team runs /nvskills-ci on a follow-up
  PR and the next sync picks up a fresh sig.
- Brand-new skill (no prior version on main): drop outright.

Both cases write to /tmp/dropped-skills.txt with a "sig drift"
reason. The "Track dropped skills" tracker-issue step is extended
to partition entries by reason (missing-artifact vs sig-drift) and
surface recovery steps for each.

Motivation: hand-cherry-picking clean skills out of sync PRs has
been a recurring task this week (PRs #226 -> #231, #232 -> #233,
#234) because the source teams' content changes outpace their sig
refresh. Automating the drop puts the recovery burden back on the
source team where it belongs, and stops drift from propagating to
Skills.sh / Codex / Hermes / OpenClaw via the catalog.

Test cases the reviewer should think about:
1. Skill unchanged in rsync → not touched (no diff).
2. Skill content + sig both changed → no drift (both moved together).
3. Skill content changed, sig unchanged → drift detected. If on
   main, revert to HEAD. If new, drop outright.
4. New skill missing sig entirely → caught by existing compliance
   step (not this one).
5. Concurrent drift across multiple skills in one product → all
   detected; product still marked changed for PR title.

Out of scope: full cryptographic verification of (sig, manifest)
pairs — that's what the "Verify Signed Skills" workflow (PR #78)
is meant to do periodically. This step catches the common authoring
mistake (forgot to /nvskills-ci) at sync time, which is what we've
been seeing all week.

Signed-off-by: Moshe Abramovitch <moshea@nvidia.com>
P1 fixes:
- Untracked files were invisible to `git diff --name-only HEAD`,
  which made the brand-new-skill drift branch unreachable and let
  existing-skill drift slip through when the change was an added
  file (not a modified one). Now unions diff + ls-files --others
  to capture the full rsync delta.
- Compliance step truncated /tmp/dropped-skills.txt, wiping drift
  records the new step had written. Switched to `touch` so both
  steps append to the same file.

P2 fix:
- `git clean -fd` doesn't remove ignored files, so the revert path
  could leave rsync-created ignored orphans behind. Use `git clean
  -fdx --` to match the "restore pre-rsync state" contract.

P3 fixes:
- `grep -qx "- $product"` parsed the leading "-" as a flag (silent
  failure -> always treated as "not found"). Use `grep -Fqx --`.
  Fix applied to both the new drift step and the pre-existing
  compliance step which had the same bug.
- Tracker-issue partition was sig-drift vs not-sig-drift, which
  lumped orphan drops into "Missing artifacts." Now uses explicit
  reason prefixes ("missing artifacts:", "orphan:", "sig drift:")
  with a 4th "Other" catch-all so nothing is silently mis-bucketed.
- Title generation refactored to build from non-empty buckets so
  it stays accurate as categories come and go.

Refactor:
- Tracker body now uses an `emit_section` shell function for the
  three (now four) categories, dropping duplicated code.

Signed-off-by: Moshe Abramovitch <moshea@nvidia.com>
P2 fix:
- Drift detection now drives the scan from "skill dirs the rsync
  touched" (git diff + ls-files --others, top-level skill-dir
  granularity) rather than "skill dirs containing SKILL.md after
  rsync." The latter missed cases where rsync deleted or renamed
  SKILL.md without refreshing skill.oms.sig — the existing signed
  copy on main would have been silently overwritten with an
  unsigned state.

P3 fix:
- Sync PR body now partitions /tmp/dropped-skills.txt by the same
  reason prefixes the tracker issue uses (missing artifacts:,
  sig drift:, orphan:, plus a catch-all). Drift entries no longer
  show up as "missing: sig drift: reverted..." nor get mis-labeled
  in the PR body.

Signed-off-by: Moshe Abramovitch <moshea@nvidia.com>
P1 fix:
- Drift scan no longer awk-extracts "skills/$1/$2" from git output —
  that collapsed multi-skill products to one entry, masking drift on
  sibling skills and misclassifying new skills under existing
  products as "existing on main." Now the rsync loop records the
  exact destination dir (skills/<catalog_dir>) into
  /tmp/rsynced-skill-dirs.txt as it goes, and the drift step reads
  that list directly. Nesting-safe (handles both flat and nested
  catalog layouts) and no path-parsing.

P2 fix:
- Loop now reads via `while IFS= read -r dir; do … done < file`
  instead of `for dir in $touched_dirs`, so paths with spaces or
  other word-splitting characters survive intact.

P3 fix:
- Title join used `IFS=', '; "${parts[*]}"`, but bash array
  expansion only uses the first IFS char as separator — output was
  "1 missing artifacts,2 sig drift" (no space). Now uses
  printf "%s, " + trailing-strip, which gives the intended
  ", "-separated string.

Signed-off-by: Moshe Abramovitch <moshea@nvidia.com>
P3 fix from Codex 4th review:
- sig_changed regex `(^|/)skill\.oms\.sig$` matched ANY path ending
  in `/skill.oms.sig`, including nested cases like
  `$dir/docs/skill.oms.sig`. Compliance only counts the root sig at
  `$dir/skill.oms.sig`, so drift detection must too. Otherwise a
  team adding a stray nested sig file would mask the stale root.
- Replaced regex with exact-match `grep -Fx -- "$sig_path"` against
  the canonical root path.

Signed-off-by: Moshe Abramovitch <moshea@nvidia.com>
Adds auto-assignment of the missing-compliance tracker issue to the
catalog PIC (default: mosheabr) so any drift change reliably
generates a GitHub notification — independent of repo-watch
settings. Without this, the rolling issue update was silent unless
the maintainer happened to be subscribed.

Configurable via a CATALOG_TRACKER_ASSIGNEES secret holding a
comma-separated list of GitHub handles, with a hardcoded fallback
so the workflow stays self-contained.

`--add-assignee` is idempotent and re-asserts assignment on every
run, so a manual unassign doesn't permanently mute notifications.

Signed-off-by: Moshe Abramovitch <moshea@nvidia.com>
@mosheabr mosheabr requested a review from sayalinvidia as a code owner June 3, 2026 18:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant