fep-report + run scripts: git_commit provenance always reaches a log file

Rockman6 · Rockman6 · commit ff6e7e494159 · 2026-04-22T18:42:56.000+08:00
Root cause: the shell runner scripts (run_freesolv_m5max.sh,
run_binding_streptavidin_gpu.sh, run_binding_egfr_gpu.sh) echoed the
header block — including `git commit: &lt;HEAD&gt;` — to stdout only.
`tee` picked up the python env block afterwards, not the header.
Result: `cellsim fep-report` parsed run.log and found no git commit,
so every tarball from the field came back with git_commit=None even
though the runner DID know the commit hash at launch.

Fix is two-sided so both in-flight and future runs are covered:

1. src/fep/report.py: _parse_run_log now falls back to env.log and
   doctor.log when run.log is missing the `git commit:` line, and
   also falls back when run.log itself is absent. Run.log wins when
   both carry the line (most local).

2. Runner scripts: wrap the header `echo` block in `{...} | tee
   "${OUT_DIR}/env.log"` so the commit lands in env.log from launch
   onward. The python env block below now appends (`tee -a`) rather
   than overwriting. Applied to all three runner scripts.

3. 7/7 regression tests covering: run.log present, env.log fallback,
   no-run.log-at-all, doctor.log fallback, empty case, run.log-wins
   precedence, direct helper. Wired into smoke.yml.

For the M5 Max run in-flight (on an older checkout): git_commit will
still be None in its tarball, but the fix means it's recoverable if
the runner supplies the commit hash out-of-band — and every future
run is covered without operator intervention.
diff --git a/.github/workflows/smoke.yml b/.github/workflows/smoke.yml
@@ -172,6 +172,9 @@ jobs:
       - name: fep-report analyser smoke (tarball → PASS/FAIL verdict)
         run: python -u tests/fep/test_report_smoke.py
 
+      - name: fep-report provenance fallback (git_commit across sibling logs)
+        run: python -u tests/fep/test_report_provenance_smoke.py
+
       - name: fep-binding validate dry-run (YAML hygiene, <1s)
         run: python -u tests/fep/test_validate_smoke.py
 
diff --git a/scripts/run_binding_egfr_gpu.sh b/scripts/run_binding_egfr_gpu.sh
@@ -50,16 +50,20 @@ OUT_DIR="run/fep/egfr_${STAMP}"
 mkdir -p "${OUT_DIR}"
 CSV="${OUT_DIR}/egfr_results.csv"
 
-echo "============================================================"
-echo "CellSim — EGFR kinase binding FEP (Milestone B flagship)"
-echo "============================================================"
-echo "  started:     $(date)"
-echo "  machine:     $(uname -a)"
-echo "  ram:         $(sysctl -n hw.memsize 2>/dev/null | awk '{print $1/1024/1024/1024 " GB"}' || echo '?')"
-echo "  git commit:  $(git rev-parse HEAD)"
-echo "  git ref:     $(git describe --tags --always 2>/dev/null || echo '?')"
-echo "  output:      ${OUT_DIR}/"
-echo ""
+# Header block — mirror to env.log so `cellsim fep-report` extracts
+# `git commit:` for provenance. Previously stdout-only.
+{
+    echo "============================================================"
+    echo "CellSim — EGFR kinase binding FEP (Milestone B flagship)"
+    echo "============================================================"
+    echo "  started:     $(date)"
+    echo "  machine:     $(uname -a)"
+    echo "  ram:         $(sysctl -n hw.memsize 2>/dev/null | awk '{print $1/1024/1024/1024 " GB"}' || echo '?')"
+    echo "  git commit:  $(git rev-parse HEAD)"
+    echo "  git ref:     $(git describe --tags --always 2>/dev/null || echo '?')"
+    echo "  output:      ${OUT_DIR}/"
+    echo ""
+} | tee "${OUT_DIR}/env.log"
 
 # Env + platform report.
 python -c "
@@ -78,7 +82,7 @@ for i in range(Platform.getNumPlatforms()):
     print(f'  {p.getName()} (speed {p.getSpeed()})')
 print()
 print('Pipeline will prefer Metal -> CUDA -> OpenCL -> CPU.')
-" 2>&1 | tee "${OUT_DIR}/env.log"
+" 2>&1 | tee -a "${OUT_DIR}/env.log"
 echo ""
 
 echo "=== cellsim doctor ==="
diff --git a/scripts/run_binding_streptavidin_gpu.sh b/scripts/run_binding_streptavidin_gpu.sh
@@ -48,16 +48,21 @@ OUT_DIR="run/fep/streptavidin_${STAMP}"
 mkdir -p "${OUT_DIR}"
 CSV="${OUT_DIR}/streptavidin_results.csv"
 
-echo "============================================================"
-echo "CellSim — Streptavidin binding FEP (Milestone B)"
-echo "============================================================"
-echo "  started:     $(date)"
-echo "  machine:     $(uname -a)"
-echo "  ram:         $(sysctl -n hw.memsize 2>/dev/null | awk '{print $1/1024/1024/1024 " GB"}' || echo '?')"
-echo "  git commit:  $(git rev-parse HEAD)"
-echo "  git ref:     $(git describe --tags --always 2>/dev/null || echo '?')"
-echo "  output:      ${OUT_DIR}/"
-echo ""
+# Header block — mirror to env.log so `cellsim fep-report` extracts
+# `git commit:` for provenance. Previously stdout-only, which made
+# report.git_commit=None on the tarball.
+{
+    echo "============================================================"
+    echo "CellSim — Streptavidin binding FEP (Milestone B)"
+    echo "============================================================"
+    echo "  started:     $(date)"
+    echo "  machine:     $(uname -a)"
+    echo "  ram:         $(sysctl -n hw.memsize 2>/dev/null | awk '{print $1/1024/1024/1024 " GB"}' || echo '?')"
+    echo "  git commit:  $(git rev-parse HEAD)"
+    echo "  git ref:     $(git describe --tags --always 2>/dev/null || echo '?')"
+    echo "  output:      ${OUT_DIR}/"
+    echo ""
+} | tee "${OUT_DIR}/env.log"
 
 # Env + platform report (critical for reproducibility).
 python -c "
@@ -76,7 +81,7 @@ for i in range(Platform.getNumPlatforms()):
     print(f'  {p.getName()} (speed {p.getSpeed()})')
 print()
 print('Pipeline will prefer Metal -> CUDA -> OpenCL -> CPU.')
-" 2>&1 | tee "${OUT_DIR}/env.log"
+" 2>&1 | tee -a "${OUT_DIR}/env.log"
 echo ""
 
 # Cellsim doctor — fail fast if env is broken.
diff --git a/scripts/run_freesolv_m5max.sh b/scripts/run_freesolv_m5max.sh
@@ -39,18 +39,25 @@ STAMP=$(date +%Y%m%d_%H%M%S)
 OUT_DIR="run/fep/${STAMP}"
 mkdir -p "${OUT_DIR}"
 
-echo "============================================================"
-echo "CellSim — FreeSolv FEP gate (Milestone A)"
-echo "============================================================"
-echo "  started:     $(date)"
-echo "  machine:     $(uname -a)"
-echo "  ram:         $(sysctl -n hw.memsize 2>/dev/null | awk '{print $1/1024/1024/1024 " GB"}' || echo '?')"
-echo "  git commit:  $(git rev-parse HEAD)"
-echo "  git ref:     $(git describe --tags --always 2>/dev/null || echo '?')"
-echo "  output:      ${OUT_DIR}/"
-echo ""
+# Header block — mirror to env.log so `cellsim fep-report` can
+# extract the `git commit:` line for provenance. Earlier versions
+# of this script sent the header to stdout only, which left
+# report.env_log metadata empty on the tarball.
+{
+    echo "============================================================"
+    echo "CellSim — FreeSolv FEP gate (Milestone A)"
+    echo "============================================================"
+    echo "  started:     $(date)"
+    echo "  machine:     $(uname -a)"
+    echo "  ram:         $(sysctl -n hw.memsize 2>/dev/null | awk '{print $1/1024/1024/1024 " GB"}' || echo '?')"
+    echo "  git commit:  $(git rev-parse HEAD)"
+    echo "  git ref:     $(git describe --tags --always 2>/dev/null || echo '?')"
+    echo "  output:      ${OUT_DIR}/"
+    echo ""
+} | tee "${OUT_DIR}/env.log"
 
-# Env + platform report — critical for reproducibility
+# Env + platform report — critical for reproducibility. Appended
+# to env.log (not overwriting the header we just wrote).
 python -c "
 import openmm, openmmtools, openff.toolkit, pymbar, sys
 from openmm import Platform
@@ -66,7 +73,7 @@ for i in range(Platform.getNumPlatforms()):
     print(f'  {p.getName()} (speed {p.getSpeed()})')
 print()
 print('Pipeline will prefer Metal → OpenCL → CUDA → CPU.')
-" 2>&1 | tee "${OUT_DIR}/env.log"
+" 2>&1 | tee -a "${OUT_DIR}/env.log"
 echo ""
 
 # Cellsim doctor — fail fast if env is broken
diff --git a/src/fep/report.py b/src/fep/report.py
@@ -249,11 +249,20 @@ def _parse_run_log(log_path: Path, rows: list[CompoundRow]) -> dict:
     this section will just be empty and we report "no GHMC data in
     log (legacy format)". The parser still works when Phase-2 wires
     per-compound acceptance through the CSV.
+
+    Provenance fallback: if run.log doesn't carry `git commit:` (the
+    shell header block tees into env.log on some pre-2026-04-21 runs
+    and into run.log after), also scan env.log and doctor.log in the
+    same directory. Biologists care that *some* log captured the
+    commit — not which one.
     """
     meta: dict = {"platform": None, "git_commit": None,
                   "ghmc_means": [], "ghmc_mins": []}
     if not log_path.exists():
-        return meta
+        # Fall through to the sibling-log scan below so we at least
+        # pick up git_commit / platform from env.log when present.
+        return _scan_sibling_logs_for_provenance(
+            log_path.parent, meta)
 
     name_to_row = {r.name: r for r in rows}
     current: Optional[CompoundRow] = None
@@ -288,6 +297,37 @@ def _parse_run_log(log_path: Path, rows: list[CompoundRow]) -> dict:
                 current.ghmc_accept_min = mn
             meta["ghmc_means"].append(mean)
             meta["ghmc_mins"].append(mn)
+    # If provenance still missing, look in sibling logs.
+    if not meta["git_commit"] or not meta["platform"]:
+        meta = _scan_sibling_logs_for_provenance(log_path.parent, meta)
+    return meta
+
+
+def _scan_sibling_logs_for_provenance(
+    dir_path: Path, meta: dict,
+) -> dict:
+    """Scan env.log and doctor.log next to run.log for git_commit
+    and platform. These files are written by the shell wrapper
+    scripts; the header-tee pattern can land the provenance in any
+    of them depending on the script version."""
+    for sibling in ("env.log", "doctor.log", "header.log"):
+        p = dir_path / sibling
+        if not p.exists():
+            continue
+        try:
+            t = p.read_text(encoding="utf-8", errors="replace")
+        except OSError:
+            continue
+        if not meta["git_commit"]:
+            m = _GIT_COMMIT_RE.search(t)
+            if m:
+                meta["git_commit"] = m.group(1).strip()
+        if not meta["platform"]:
+            m = _PLATFORM_RE.search(t)
+            if m:
+                meta["platform"] = m.group(1).strip()
+        if meta["git_commit"] and meta["platform"]:
+            break
     return meta
 
 
diff --git a/tests/fep/test_report_provenance_smoke.py b/tests/fep/test_report_provenance_smoke.py
@@ -0,0 +1,145 @@
+"""Regression tests for fep-report provenance extraction.
+
+The shell wrapper scripts evolved over time in where they tee the
+`git commit:` line — earlier versions sent it to stdout only, then
+into run.log, then into env.log. `cellsim fep-report` has to find
+it wherever it landed, because a biologist handed a tarball from
+someone else's machine cannot re-run git commands: the commit hash
+IS the provenance.
+
+Pins: _parse_run_log scans run.log, env.log, and doctor.log and
+returns the first git_commit / platform it finds.
+"""
+from __future__ import annotations
+
+import sys
+import tempfile
+from pathlib import Path
+
+REPO = Path(__file__).resolve().parents[2]
+sys.path.insert(0, str(REPO))
+
+from src.fep.report import (
+    _parse_run_log,
+    _scan_sibling_logs_for_provenance,
+)
+
+
+def test_git_commit_picked_up_from_run_log():
+    with tempfile.TemporaryDirectory(prefix="fep_prov_") as tmp:
+        d = Path(tmp)
+        (d / "run.log").write_text(
+            "CellSim\n"
+            "  git commit:  abc1234567890abcdef\n"
+            "FEP sampling platform: CUDA\n"
+            "[freesolv] methane C\n")
+        meta = _parse_run_log(d / "run.log", rows=[])
+    assert meta["git_commit"] == "abc1234567890abcdef", meta
+    assert meta["platform"] == "CUDA", meta
+
+
+def test_git_commit_fallback_to_env_log():
+    """Old scripts sent the header to stdout and env.log. run.log
+    contains only bench output. fep-report must still populate
+    git_commit by falling back to env.log."""
+    with tempfile.TemporaryDirectory(prefix="fep_prov_") as tmp:
+        d = Path(tmp)
+        (d / "env.log").write_text(
+            "CellSim — FreeSolv FEP gate\n"
+            "  git commit:  deadbeefcafe0000111122223333\n"
+            "  openmm 8.2.0\n")
+        (d / "run.log").write_text(
+            "[freesolv] methane C\n"
+            "...bench output without provenance...\n"
+            "FEP sampling platform: Metal\n")
+        meta = _parse_run_log(d / "run.log", rows=[])
+    assert meta["git_commit"] == "deadbeefcafe0000111122223333", meta
+    assert meta["platform"] == "Metal", meta
+
+
+def test_git_commit_fallback_when_no_run_log():
+    """Malformed tarball (no run.log) — only env.log has provenance.
+    Parser must still recover git_commit via sibling scan."""
+    with tempfile.TemporaryDirectory(prefix="fep_prov_") as tmp:
+        d = Path(tmp)
+        (d / "env.log").write_text(
+            "  git commit:  1111222233334444aaaabbbbccccdddd\n")
+        # Intentionally no run.log.
+        meta = _parse_run_log(d / "run.log", rows=[])
+    assert meta["git_commit"] == "1111222233334444aaaabbbbccccdddd", meta
+
+
+def test_doctor_log_as_last_resort():
+    """Some runs only landed provenance in doctor.log."""
+    with tempfile.TemporaryDirectory(prefix="fep_prov_") as tmp:
+        d = Path(tmp)
+        (d / "doctor.log").write_text(
+            "cellsim doctor\n"
+            "git commit: feedface99887766\n"
+            "OK\n")
+        meta = _parse_run_log(d / "run.log", rows=[])
+    assert meta["git_commit"] == "feedface99887766", meta
+
+
+def test_no_logs_at_all_returns_none():
+    """Nothing to find — git_commit stays None, doesn't crash."""
+    with tempfile.TemporaryDirectory(prefix="fep_prov_") as tmp:
+        d = Path(tmp)
+        meta = _parse_run_log(d / "run.log", rows=[])
+    assert meta["git_commit"] is None
+    assert meta["platform"] is None
+
+
+def test_run_log_wins_over_sibling():
+    """When both run.log and env.log have git commit, run.log wins
+    (most local to the bench process)."""
+    with tempfile.TemporaryDirectory(prefix="fep_prov_") as tmp:
+        d = Path(tmp)
+        (d / "run.log").write_text(
+            "git commit: 0000000aaaaaaaabbbbbbbbbbcccccc\n")
+        (d / "env.log").write_text(
+            "git commit: ffffffffffffffffffffffffffffffff\n")
+        meta = _parse_run_log(d / "run.log", rows=[])
+    assert meta["git_commit"] == "0000000aaaaaaaabbbbbbbbbbcccccc", meta
+
+
+def test_sibling_scan_direct_helper():
+    """_scan_sibling_logs_for_provenance is a public-ish helper;
+    cover it directly so the fallback policy (env > doctor > header)
+    is pinned."""
+    with tempfile.TemporaryDirectory(prefix="fep_prov_") as tmp:
+        d = Path(tmp)
+        (d / "env.log").write_text("git commit: aaaaaaaaaaaaaaaa\n")
+        (d / "doctor.log").write_text("git commit: bbbbbbbbbbbbbbbb\n")
+        meta = _scan_sibling_logs_for_provenance(
+            d, {"git_commit": None, "platform": None,
+                "ghmc_means": [], "ghmc_mins": []})
+    # env.log scanned first.
+    assert meta["git_commit"] == "aaaaaaaaaaaaaaaa", meta
+
+
+if __name__ == "__main__":
+    funcs = [
+        test_git_commit_picked_up_from_run_log,
+        test_git_commit_fallback_to_env_log,
+        test_git_commit_fallback_when_no_run_log,
+        test_doctor_log_as_last_resort,
+        test_no_logs_at_all_returns_none,
+        test_run_log_wins_over_sibling,
+        test_sibling_scan_direct_helper,
+    ]
+    fails = []
+    for f in funcs:
+        try:
+            f()
+            print(f"[PASS] {f.__name__}")
+        except AssertionError as e:
+            print(f"[FAIL] {f.__name__}: {e}")
+            fails.append(f.__name__)
+        except Exception as e:
+            import traceback
+            traceback.print_exc()
+            print(f"[ERROR] {f.__name__}: {e}")
+            fails.append(f.__name__)
+    print(f"{len(funcs) - len(fails)}/{len(funcs)} PASS")
+    sys.exit(0 if not fails else 1)