fix: pre-fetch task configs before QEMU reset to avoid stale socat

abrichr · claude · abrichr · commit af69221e49d0 · 2026-03-01T17:46:23.000-05:00
The evaluate server (localhost:5050) goes through a socat bridge that
can become stale after container/VM restarts. Pre-fetching all task
configs before the QEMU reset ensures human-readable instructions are
cached in memory even if the bridge dies later. Falls back to live
fetch with retry on cache miss.

Co-Authored-By: Claude Opus 4.6 &lt;noreply@anthropic.com&gt;
diff --git a/.beads/issues.jsonl b/.beads/issues.jsonl
@@ -11,7 +11,7 @@
 {"id":"openadapt-evals-dke","title":"SYSTEM: Create knowledge persistence workflow using Beads","description":"Every fix/approach must be logged as a Beads issue with:\n1. Problem description\n2. Attempted solution\n3. Result (worked/failed/partial)\n4. Root cause if known\n5. Files changed\n\nBefore any fix attempt, agent MUST:\n1. Run 'bd list --labels=fix,approach' to see prior attempts\n2. Review what was tried before\n3. Document new attempt BEFORE implementing\n\nAfter context compaction, first action:\n1. Run 'bd ready' for current tasks\n2. Run 'bd list --labels=recurring' for known recurring issues\n3. Check docs/RECURRING_ISSUES.md for patterns","status":"closed","priority":0,"issue_type":"task","owner":"richard.abrich@gmail.com","created_at":"2026-01-20T19:00:18.155796-05:00","created_by":"Richard Abrich","updated_at":"2026-02-23T16:21:13.18811-05:00","closed_at":"2026-02-14T12:22:52.357373-05:00"}
 {"id":"openadapt-evals-gna","title":"Test simplified Dockerfile (Azure mode)","description":"Testing Dockerfile.simplified which uses vanilla WAA Azure mode: native OEM mechanism (C:\\oem), InstallFrom element for unattended install, VERSION=11e for no product key. Steps: 1) Delete current VM 2) Create fresh VM 3) Build simplified image 4) Test Windows installation via QEMU screenshots","notes":"2026-01-22: Confirmed the blocker is not just docker pull; even starting the existing 'winarena' container via az vm run-command timed out.\n\n- smoke-live tried to run docker start winarena via run-command and timed out (900s)\n- WAA server remained unreachable at http://172.171.112.41:5000\n- VM was deallocated after the attempt\n\nImplication: VM/docker state is unhealthy or container start is hanging (possibly due to incomplete image extraction / stuck daemon / disk pressure).\nNext: add/run a vm-debug command to capture docker/system logs and determine whether to rebuild VM/image, pin/mirror image (ACR), or adjust docker config.","status":"closed","priority":0,"issue_type":"task","owner":"richard.abrich@gmail.com","created_at":"2026-01-21T12:47:15.12243-05:00","created_by":"Richard Abrich","updated_at":"2026-02-23T16:21:13.188539-05:00","closed_at":"2026-02-08T13:23:34.84444-05:00","labels":["testing","waa"],"comments":[{"id":3,"issue_id":"openadapt-evals-gna","author":"Richard Abrich","text":"Session Recovery 2026-01-22 17:58: Previous agents killed during compaction. VM state: Docker/containerd unhealthy, disk /mnt only 32GB (need 47GB+ for vanilla WAA). Git-lfs failing. User feedback: 1) use beads, 2) larger disk, 3) clean up CLI, 4) vanilla WAA config.","created_at":"2026-01-22T18:05:45Z"},{"id":4,"issue_id":"openadapt-evals-gna","author":"Richard Abrich","text":"Launched 3 parallel agents: ae159fc (VM disk upgrade), aabad47 (CLI cleanup), aee4e8a (fix containerd). Check /private/tmp/claude/-Users-abrichr-oa-src-openadapt-ml/tasks/*.output for results.","created_at":"2026-01-22T18:06:18Z"},{"id":5,"issue_id":"openadapt-evals-gna","author":"Richard Abrich","text":"WORKFLOW DOCUMENTED: VM config changes = delete VM -\u003e update code -\u003e relaunch. Added to CLAUDE.md. Default VM size now D8ds_v5 (300GB). Launching fresh VM now.","created_at":"2026-01-22T18:09:12Z"},{"id":6,"issue_id":"openadapt-evals-gna","author":"Richard Abrich","text":"2026-01-22 18:20: VM resources cleaned up, launched agent a9be1f8 to add auto-cleanup to CLI, WAA setup retrying in background (b04fcbe). Workflow documented in CLAUDE.md and STATUS.md.","created_at":"2026-01-22T18:11:56Z"},{"id":7,"issue_id":"openadapt-evals-gna","author":"Richard Abrich","text":"2026-01-22 18:30: VM created with D8s_v3 fallback (D8ds_v5 quota 0), IP 20.120.37.97. Restored waa_deploy symlink. Docker image building. W\u0026B integration agent a21c3ef running.","created_at":"2026-01-22T18:25:29Z"},{"id":8,"issue_id":"openadapt-evals-gna","author":"Richard Abrich","text":"2026-01-22 19:05: WAA Docker image built successfully! Container running. Windows booting. VM: 20.120.37.97, VNC: http://20.120.37.97:8006","created_at":"2026-01-22T18:47:03Z"}]}
 {"id":"openadapt-evals-hvm","title":"VL model fix PR #18 ready to merge","notes":"2026-02-08: openadapt-ml PR #18 was already merged on 2026-01-29. VL model fix is done.","status":"closed","priority":0,"issue_type":"task","owner":"richard.abrich@gmail.com","created_at":"2026-01-29T16:17:03.491938-05:00","created_by":"Richard Abrich","updated_at":"2026-02-08T12:55:19.233249-05:00","closed_at":"2026-02-08T12:55:19.233249-05:00","close_reason":"PR #18 already merged 2026-01-29"}
-{"id":"openadapt-evals-mx8","title":"Analyze evaluation results and publish findings","description":"After demo-conditioned evaluation completes, analyze results: success rates, failure modes, demo impact. Create data-driven roadmap for improvements.","status":"open","priority":1,"issue_type":"task","owner":"richard.abrich@gmail.com","created_at":"2026-02-14T12:23:06.328838-05:00","created_by":"Richard Abrich","updated_at":"2026-02-14T12:23:06.328838-05:00"}
+{"id":"openadapt-evals-mx8","title":"Analyze evaluation results and publish findings","description":"After demo-conditioned evaluation completes, analyze results: success rates, failure modes, demo impact. Create data-driven roadmap for improvements.","notes":"wright repo (OpenAdaptAI/wright) scaffolding underway. Herald + consilium repos transferred to OpenAdaptAI org. Wright will be the orchestration layer for eval pipeline.","status":"open","priority":1,"issue_type":"task","owner":"richard.abrich@gmail.com","created_at":"2026-02-14T12:23:06.328838-05:00","created_by":"Richard Abrich","updated_at":"2026-03-01T17:46:08.553556-05:00"}
 {"id":"openadapt-evals-sz4","title":"RCA: Windows product key prompt recurring issue","status":"closed","priority":0,"issue_type":"task","owner":"richard.abrich@gmail.com","created_at":"2026-01-20T18:59:36.266286-05:00","created_by":"Richard Abrich","updated_at":"2026-01-20T20:32:06.493102-05:00","closed_at":"2026-01-20T20:32:06.493102-05:00","close_reason":"RCA complete - root cause is VERSION mismatch (CLI=11, Dockerfile=11e). Fix documented in RECURRING_ISSUES.md and WINDOWS_PRODUCT_KEY_RCA.md"}
-{"id":"openadapt-evals-vcb","title":"Run demo-conditioned WAA evaluation","description":"Once demos are recorded, run WAA evaluation with demo-conditioned agents (RetrievalAugmentedAgent with real demos). Target: measure improvement over zero-shot baseline. Requires real demos from recording task.","notes":"2026-03-01: GPU grant applications reviewed and rewritten (11 files). Writing done, blocked on eval results (DC signal on harder tasks). Detailed status tracked in openadapt-internal (private repo).","status":"open","priority":0,"issue_type":"task","owner":"richard.abrich@gmail.com","created_at":"2026-02-14T12:23:04.624305-05:00","created_by":"Richard Abrich","updated_at":"2026-03-01T13:57:25.582064-05:00"}
+{"id":"openadapt-evals-vcb","title":"Run demo-conditioned WAA evaluation","description":"Once demos are recorded, run WAA evaluation with demo-conditioned agents (RetrievalAugmentedAgent with real demos). Target: measure improvement over zero-shot baseline. Requires real demos from recording task.","notes":"wright repo created (OpenAdaptAI/wright), scaffolding in progress. Herald + consilium transferred to OpenAdaptAI org.","status":"open","priority":0,"issue_type":"task","owner":"richard.abrich@gmail.com","created_at":"2026-02-14T12:23:04.624305-05:00","created_by":"Richard Abrich","updated_at":"2026-03-01T17:45:50.958358-05:00"}
 {"id":"openadapt-evals-wis","title":"Add pre-flight check to detect Windows install issues","status":"closed","priority":1,"issue_type":"task","owner":"richard.abrich@gmail.com","created_at":"2026-01-20T18:59:36.865052-05:00","created_by":"Richard Abrich","updated_at":"2026-01-20T20:32:06.757261-05:00","closed_at":"2026-01-20T20:32:06.757261-05:00","close_reason":"Duplicate of openadapt-evals-0dt"}
diff --git a/scripts/record_waa_demos.py b/scripts/record_waa_demos.py
@@ -1513,17 +1513,40 @@ def cmd_record_waa(
     if not connected:
         return
 
+    # Pre-fetch all task configs BEFORE QEMU reset.  The evaluate server
+    # (localhost:5050) goes through a socat bridge that can become stale
+    # after container/VM restarts.  Fetching early ensures we have the
+    # human-readable instructions cached even if the bridge dies later.
+    print("Pre-fetching task configs from evaluate server...")
+    task_configs_cache: dict[str, dict] = {}
+    fetch_failures = 0
+    for task_id in task_ids:
+        try:
+            resp = requests.get(f"{evaluate_url}/task/{task_id}", timeout=10)
+            if resp.ok:
+                task_configs_cache[task_id] = resp.json()
+        except Exception:
+            fetch_failures += 1
+    if task_configs_cache:
+        print(f"  Cached {len(task_configs_cache)}/{len(task_ids)} task config(s).")
+    if fetch_failures:
+        print(f"  WARNING: {fetch_failures} task config(s) failed to fetch from {evaluate_url}.")
+        print(f"  Step generation will use task IDs instead of instructions for those tasks.")
+        print(f"  (Is the evaluate server / socat proxy running?)")
+    if not task_configs_cache and len(task_ids) > 0:
+        print(f"  ERROR: Could not fetch ANY task configs from {evaluate_url}.")
+        print(f"  The evaluate server may be down. Check socat proxy on the VM.")
+        answer = input("  Continue anyway? [y/N] ").strip().lower()
+        if answer not in ("y", "yes"):
+            return
+    print()
+
     # Pre-flight: verify all required apps are installed
     if verify:
         print("Verifying required apps across all tasks...")
         all_apps: set[str] = set()
-        for task_id in task_ids:
-            try:
-                resp = requests.get(f"{evaluate_url}/task/{task_id}", timeout=10)
-                if resp.ok:
-                    all_apps.update(resp.json().get("related_apps", []))
-            except Exception:
-                pass
+        for tc in task_configs_cache.values():
+            all_apps.update(tc.get("related_apps", []))
         if all_apps:
             resp = requests.post(
                 f"{evaluate_url}/setup",
@@ -1579,26 +1602,27 @@ def cmd_record_waa(
         task_dir = output_dir / task_id
         task_dir.mkdir(parents=True, exist_ok=True)
 
-        # Load task config from evaluate server (retry on transient errors)
-        instruction = task_id  # fallback
-        task_config = {}
-        for _attempt in range(3):
-            try:
-                task_resp = requests.get(
-                    f"{evaluate_url}/task/{task_id}", timeout=10
-                )
-                if task_resp.ok:
-                    task_config = task_resp.json()
-                    instruction = task_config.get(
-                        "instruction", task_config.get("task", task_id)
+        # Load task config — prefer pre-fetched cache, fall back to live fetch
+        task_config = task_configs_cache.get(task_id, {})
+        if not task_config:
+            # Try live fetch (cache miss or evaluate server was down earlier)
+            for _attempt in range(3):
+                try:
+                    task_resp = requests.get(
+                        f"{evaluate_url}/task/{task_id}", timeout=10
                     )
-                    break
-            except Exception as e:
-                if _attempt < 2:
-                    print(f"  Warning: task config fetch failed ({e}), retrying...")
-                    time.sleep(2)
-                else:
-                    print(f"  Warning: could not load task config after 3 attempts: {e}")
+                    if task_resp.ok:
+                        task_config = task_resp.json()
+                        break
+                except Exception as e:
+                    if _attempt < 2:
+                        print(f"  Warning: task config fetch failed ({e}), retrying...")
+                        time.sleep(2)
+                    else:
+                        print(f"  Warning: could not load task config after 3 attempts: {e}")
+        instruction = task_config.get(
+            "instruction", task_config.get("task", task_id)
+        )
 
         def _setup_task_env() -> None:
             """Run task setup config (download files, open apps, etc.)."""