Skip to content

Commit 9c86870

Browse files
committed
docs: add session learnings from 7 newly reviewed archive sessions
New gotcha sections: Python/Subprocess, Dashboard/Streamlit, LLM Judge, plus additions to Git/Auth, MCP Configuration, and Validation/Scoring. Key learnings: dict.get() None gotcha, Streamlit widget key uniqueness, stdio vs HTTP transport for Sourcegraph MCP, shallow clone push failures, task-type-aware LLM judge evaluation, and tool categorization order.
1 parent 5041ad3 commit 9c86870

File tree

5 files changed

+69
-138
lines changed

5 files changed

+69
-138
lines changed

AGENTS.md

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -74,6 +74,8 @@ curl -fsSL https://raw.githubusercontent.com/steveyegge/beads/main/scripts/insta
7474
- Claude Code requires the `--mcp-config` CLI flag to load MCP config -- it does not auto-detect.
7575
- Inject MCP usage instructions into the task prompt. Agents won't use MCP tools just because they're available.
7676
- Set `NODE_TLS_REJECT_UNAUTHORIZED=0` for Node.js SSL in Docker containers (curl working does not mean Node.js fetch will work).
77+
- Sourcegraph MCP uses **stdio transport** (`npx @sourcegraph/cody --stdio`), NOT HTTP endpoints. HTTP 405 from the endpoint means it exists but requires stdio.
78+
- Sourcegraph skills installed via `npx -y skills add` show empty `"skills": []` in headless/containerized mode. Embed skill prompt content directly in the task's CLAUDE.md instead.
7779

7880
### Harbor Result Format
7981
- Timing fields (`started_at`, `finished_at`) live at the **top level** of `result.json`, not nested under `timing`.
@@ -84,11 +86,31 @@ curl -fsSL https://raw.githubusercontent.com/steveyegge/beads/main/scripts/insta
8486
- `validators.py` is duplicated across `ccb_build` tasks. Changes must be applied to **all copies** (verify with `sha256sum`).
8587
- Install scripts that print "INSTALL_SUCCESS" regardless of actual outcome are common. Always verify the binary exists and is executable.
8688
- Agent completing in **<2 seconds** = agent never installed/ran (smoke test heuristic).
89+
- Trial directory names are truncated with hash suffixes (e.g., `c_api_graphql_expert_079_archite__pm9xcPn`). The real task name lives in `config.json` at `task.path`.
8790

8891
### Git / Auth
8992
- `gh auth refresh` without `-s <scope>` is a no-op for adding scopes. Must use `gh auth refresh -h github.com -s write:packages` explicitly.
9093
- Environment variables must be **explicitly exported** for Harbor subprocesses. Use `set -a` before sourcing `.env.local`.
9194
- GitHub push protection blocks synthetic/fake API keys in test data. Use `git reset --soft origin/main` to squash intermediate commits that contained fake credentials.
95+
- Shallow clones (`--depth 1`) fail on push to GitHub with `remote: fatal: did not receive expected object`. Always use full clones for repos that will be pushed.
96+
- Some repos use `master` as default branch. Detect with `git symbolic-ref refs/remotes/origin/HEAD` and remap to `main` if needed.
97+
- GitHub secret scanning blocks pushes containing embedded secrets (Slack webhooks, API keys in source). Users must manually unblock via the provided `/security/secret-scanning/unblock-secret/` URL.
98+
99+
### Python / Subprocess
100+
- `dict.get(key, default)` does **NOT** protect against `None` values. If key exists with value `None`, the default is not used. Use `data.get("key") or default_value` for Harbor result fields that may be `null`.
101+
- `with open(log) as f: subprocess.Popen(stdout=f)` closes the file handle immediately after `Popen()` returns. Use `open()` without context manager for long-running subprocesses.
102+
- macOS ships Bash 3.2 which lacks associative arrays (`declare -A`). Use pipe-delimited string arrays with `IFS='|' read -r` for compatibility.
103+
104+
### Dashboard / Streamlit
105+
- Streamlit widget keys in loops must include an index or unique ID to avoid `DuplicateElementKey` errors (e.g., `key=f"nav_{idx}_{page}"` not `key=f"nav_{page}"`).
106+
- `st.session_state` cannot be modified after widget instantiation. Use `on_click` callback pattern that sets state before widget rerender.
107+
- Sidebar config below navigation menu is invisible without scrolling. Put critical UI controls in the main content area using `st.columns()`.
108+
- Always check actual dataclass field names before writing view code. Common mismatches: `agent_results` vs `agent_metrics`, `anomalies` vs `total_anomalies`, dict access vs object attributes.
109+
110+
### LLM Judge
111+
- Always include "Respond with valid JSON only (escape all quotes and special characters)" in judge prompts. Unescaped quotes in LLM-generated JSON break parsing.
112+
- Judge should use task-type-aware evaluation: different rubrics for code implementation, architectural understanding, and bug fix tasks.
113+
- Tool categorization order matters: check MCP prefix (`mcp__`) before substring checks (e.g., `deep_search`) to avoid miscategorization of `mcp__deep_search`.
92114

93115
## Maintenance
94116
- Root and local `AGENTS.md` / `CLAUDE.md` files are generated from sources in `docs/ops/`.

CLAUDE.md

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -74,6 +74,8 @@ curl -fsSL https://raw.githubusercontent.com/steveyegge/beads/main/scripts/insta
7474
- Claude Code requires the `--mcp-config` CLI flag to load MCP config -- it does not auto-detect.
7575
- Inject MCP usage instructions into the task prompt. Agents won't use MCP tools just because they're available.
7676
- Set `NODE_TLS_REJECT_UNAUTHORIZED=0` for Node.js SSL in Docker containers (curl working does not mean Node.js fetch will work).
77+
- Sourcegraph MCP uses **stdio transport** (`npx @sourcegraph/cody --stdio`), NOT HTTP endpoints. HTTP 405 from the endpoint means it exists but requires stdio.
78+
- Sourcegraph skills installed via `npx -y skills add` show empty `"skills": []` in headless/containerized mode. Embed skill prompt content directly in the task's CLAUDE.md instead.
7779

7880
### Harbor Result Format
7981
- Timing fields (`started_at`, `finished_at`) live at the **top level** of `result.json`, not nested under `timing`.
@@ -84,11 +86,31 @@ curl -fsSL https://raw.githubusercontent.com/steveyegge/beads/main/scripts/insta
8486
- `validators.py` is duplicated across `ccb_build` tasks. Changes must be applied to **all copies** (verify with `sha256sum`).
8587
- Install scripts that print "INSTALL_SUCCESS" regardless of actual outcome are common. Always verify the binary exists and is executable.
8688
- Agent completing in **<2 seconds** = agent never installed/ran (smoke test heuristic).
89+
- Trial directory names are truncated with hash suffixes (e.g., `c_api_graphql_expert_079_archite__pm9xcPn`). The real task name lives in `config.json` at `task.path`.
8790

8891
### Git / Auth
8992
- `gh auth refresh` without `-s <scope>` is a no-op for adding scopes. Must use `gh auth refresh -h github.com -s write:packages` explicitly.
9093
- Environment variables must be **explicitly exported** for Harbor subprocesses. Use `set -a` before sourcing `.env.local`.
9194
- GitHub push protection blocks synthetic/fake API keys in test data. Use `git reset --soft origin/main` to squash intermediate commits that contained fake credentials.
95+
- Shallow clones (`--depth 1`) fail on push to GitHub with `remote: fatal: did not receive expected object`. Always use full clones for repos that will be pushed.
96+
- Some repos use `master` as default branch. Detect with `git symbolic-ref refs/remotes/origin/HEAD` and remap to `main` if needed.
97+
- GitHub secret scanning blocks pushes containing embedded secrets (Slack webhooks, API keys in source). Users must manually unblock via the provided `/security/secret-scanning/unblock-secret/` URL.
98+
99+
### Python / Subprocess
100+
- `dict.get(key, default)` does **NOT** protect against `None` values. If key exists with value `None`, the default is not used. Use `data.get("key") or default_value` for Harbor result fields that may be `null`.
101+
- `with open(log) as f: subprocess.Popen(stdout=f)` closes the file handle immediately after `Popen()` returns. Use `open()` without context manager for long-running subprocesses.
102+
- macOS ships Bash 3.2 which lacks associative arrays (`declare -A`). Use pipe-delimited string arrays with `IFS='|' read -r` for compatibility.
103+
104+
### Dashboard / Streamlit
105+
- Streamlit widget keys in loops must include an index or unique ID to avoid `DuplicateElementKey` errors (e.g., `key=f"nav_{idx}_{page}"` not `key=f"nav_{page}"`).
106+
- `st.session_state` cannot be modified after widget instantiation. Use `on_click` callback pattern that sets state before widget rerender.
107+
- Sidebar config below navigation menu is invisible without scrolling. Put critical UI controls in the main content area using `st.columns()`.
108+
- Always check actual dataclass field names before writing view code. Common mismatches: `agent_results` vs `agent_metrics`, `anomalies` vs `total_anomalies`, dict access vs object attributes.
109+
110+
### LLM Judge
111+
- Always include "Respond with valid JSON only (escape all quotes and special characters)" in judge prompts. Unescaped quotes in LLM-generated JSON break parsing.
112+
- Judge should use task-type-aware evaluation: different rubrics for code implementation, architectural understanding, and bug fix tasks.
113+
- Tool categorization order matters: check MCP prefix (`mcp__`) before substring checks (e.g., `deep_search`) to avoid miscategorization of `mcp__deep_search`.
92114

93115
## Maintenance
94116
- Root and local `AGENTS.md` / `CLAUDE.md` files are generated from sources in `docs/ops/`.

docs/ops/ROOT_AGENT_GUIDE.md

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -74,6 +74,8 @@ curl -fsSL https://raw.githubusercontent.com/steveyegge/beads/main/scripts/insta
7474
- Claude Code requires the `--mcp-config` CLI flag to load MCP config -- it does not auto-detect.
7575
- Inject MCP usage instructions into the task prompt. Agents won't use MCP tools just because they're available.
7676
- Set `NODE_TLS_REJECT_UNAUTHORIZED=0` for Node.js SSL in Docker containers (curl working does not mean Node.js fetch will work).
77+
- Sourcegraph MCP uses **stdio transport** (`npx @sourcegraph/cody --stdio`), NOT HTTP endpoints. HTTP 405 from the endpoint means it exists but requires stdio.
78+
- Sourcegraph skills installed via `npx -y skills add` show empty `"skills": []` in headless/containerized mode. Embed skill prompt content directly in the task's CLAUDE.md instead.
7779

7880
### Harbor Result Format
7981
- Timing fields (`started_at`, `finished_at`) live at the **top level** of `result.json`, not nested under `timing`.
@@ -84,11 +86,31 @@ curl -fsSL https://raw.githubusercontent.com/steveyegge/beads/main/scripts/insta
8486
- `validators.py` is duplicated across `ccb_build` tasks. Changes must be applied to **all copies** (verify with `sha256sum`).
8587
- Install scripts that print "INSTALL_SUCCESS" regardless of actual outcome are common. Always verify the binary exists and is executable.
8688
- Agent completing in **<2 seconds** = agent never installed/ran (smoke test heuristic).
89+
- Trial directory names are truncated with hash suffixes (e.g., `c_api_graphql_expert_079_archite__pm9xcPn`). The real task name lives in `config.json` at `task.path`.
8790

8891
### Git / Auth
8992
- `gh auth refresh` without `-s <scope>` is a no-op for adding scopes. Must use `gh auth refresh -h github.com -s write:packages` explicitly.
9093
- Environment variables must be **explicitly exported** for Harbor subprocesses. Use `set -a` before sourcing `.env.local`.
9194
- GitHub push protection blocks synthetic/fake API keys in test data. Use `git reset --soft origin/main` to squash intermediate commits that contained fake credentials.
95+
- Shallow clones (`--depth 1`) fail on push to GitHub with `remote: fatal: did not receive expected object`. Always use full clones for repos that will be pushed.
96+
- Some repos use `master` as default branch. Detect with `git symbolic-ref refs/remotes/origin/HEAD` and remap to `main` if needed.
97+
- GitHub secret scanning blocks pushes containing embedded secrets (Slack webhooks, API keys in source). Users must manually unblock via the provided `/security/secret-scanning/unblock-secret/` URL.
98+
99+
### Python / Subprocess
100+
- `dict.get(key, default)` does **NOT** protect against `None` values. If key exists with value `None`, the default is not used. Use `data.get("key") or default_value` for Harbor result fields that may be `null`.
101+
- `with open(log) as f: subprocess.Popen(stdout=f)` closes the file handle immediately after `Popen()` returns. Use `open()` without context manager for long-running subprocesses.
102+
- macOS ships Bash 3.2 which lacks associative arrays (`declare -A`). Use pipe-delimited string arrays with `IFS='|' read -r` for compatibility.
103+
104+
### Dashboard / Streamlit
105+
- Streamlit widget keys in loops must include an index or unique ID to avoid `DuplicateElementKey` errors (e.g., `key=f"nav_{idx}_{page}"` not `key=f"nav_{page}"`).
106+
- `st.session_state` cannot be modified after widget instantiation. Use `on_click` callback pattern that sets state before widget rerender.
107+
- Sidebar config below navigation menu is invisible without scrolling. Put critical UI controls in the main content area using `st.columns()`.
108+
- Always check actual dataclass field names before writing view code. Common mismatches: `agent_results` vs `agent_metrics`, `anomalies` vs `total_anomalies`, dict access vs object attributes.
109+
110+
### LLM Judge
111+
- Always include "Respond with valid JSON only (escape all quotes and special characters)" in judge prompts. Unescaped quotes in LLM-generated JSON break parsing.
112+
- Judge should use task-type-aware evaluation: different rubrics for code implementation, architectural understanding, and bug fix tasks.
113+
- Tool categorization order matters: check MCP prefix (`mcp__`) before substring checks (e.g., `deep_search`) to avoid miscategorization of `mcp__deep_search`.
92114

93115
## Maintenance
94116
- Root and local `AGENTS.md` / `CLAUDE.md` files are generated from sources in `docs/ops/`.

docs/ops/SCRIPT_INDEX.md

Lines changed: 0 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -32,13 +32,9 @@ Generated from `scripts/registry.json` by `scripts/generate_script_index.py`.
3232

3333
## Analysis & Comparison
3434

35-
- `scripts/analyze_harness_design.py` - Analysis/comparison script for analyze harness design.
3635
- `scripts/analyze_mcp_unique_haiku.py` - Analysis/comparison script for analyze mcp unique haiku.
37-
- `scripts/analyze_minimum_subset.py` - Analysis/comparison script for analyze minimum subset.
3836
- `scripts/analyze_paired_cost_official_raw.py` - Analysis/comparison script for analyze paired cost official raw.
39-
- `scripts/analyze_rq_power.py` - Analysis/comparison script for analyze rq power.
4037
- `scripts/analyze_run_coverage.py` - Analysis/comparison script for analyze run coverage.
41-
- `scripts/analyze_size_effects.py` - Analysis/comparison script for analyze size effects.
4238
- `scripts/audit_traces.py` - Analysis/comparison script for audit traces.
4339
- `scripts/compare_configs.py` - Compares benchmark outcomes across configs on matched task sets.
4440
- `scripts/comprehensive_analysis.py` - Analysis/comparison script for comprehensive analysis.
@@ -115,7 +111,6 @@ Generated from `scripts/registry.json` by `scripts/generate_script_index.py`.
115111

116112
## Infra & Mirrors
117113

118-
- `scripts/build_conversation_db.py` - Infrastructure or mirror management script for build conversation db.
119114
- `scripts/build_core_manifest.py` - Infrastructure or mirror management script for build core manifest.
120115
- `scripts/build_daytona_registry.py` - Infrastructure or mirror management script for build daytona registry.
121116
- `scripts/build_linux_base_images.sh` - Infrastructure or mirror management script for build linux base images.
@@ -188,19 +183,15 @@ Generated from `scripts/registry.json` by `scripts/generate_script_index.py`.
188183
- `scripts/check_harness_readiness.py` - Utility script for check harness readiness.
189184
- `scripts/collect_repo_cloc.py` - Utility script for collect repo cloc.
190185
- `scripts/compare_contextbench_results.py` - Utility script for compare contextbench results.
191-
- `scripts/compare_old_new_ground_truth.py` - Utility script for compare old new ground truth.
192-
- `scripts/compute_analysis_ir_metrics.py` - Utility script for compute analysis ir metrics.
193186
- `scripts/compute_bootstrap_cis.py` - Utility script for compute bootstrap cis.
194187
- `scripts/context_retrieval_agent.py` - Utility script for context retrieval agent.
195188
- `scripts/control_plane.py` - Utility script for control plane.
196189
- `scripts/convert_harbor_to_contextbench.py` - Utility script for convert harbor to contextbench.
197190
- `scripts/cross_validate_gt.py` - Utility script for cross validate gt.
198191
- `scripts/cross_validate_oracles.py` - Utility script for cross validate oracles.
199-
- `scripts/daytona_cost_guard.py` - Utility script for daytona cost guard.
200192
- `scripts/daytona_curator_runner.py` - Utility script for daytona curator runner.
201193
- `scripts/daytona_poc_runner.py` - Utility script for daytona poc runner.
202194
- `scripts/daytona_runner.py` - Utility script for daytona runner.
203-
- `scripts/daytona_snapshot_cleanup.py` - Utility script for daytona snapshot cleanup.
204195
- `scripts/dependeval_eval_dr.py` - Utility script for dependeval eval dr.
205196
- `scripts/dependeval_eval_me.py` - Utility script for dependeval eval me.
206197
- `scripts/derive_n_repos.py` - Utility script for derive n repos.
@@ -209,8 +200,6 @@ Generated from `scripts/registry.json` by `scripts/generate_script_index.py`.
209200
- `scripts/doe_select_tasks.py` - Utility script for doe select tasks.
210201
- `scripts/ds_hybrid_retrieval.py` - Utility script for ds hybrid retrieval.
211202
- `scripts/ds_wrapper.sh` - Utility script for ds wrapper.
212-
- `scripts/export_conversation_blog_assets.py` - Utility script for export conversation blog assets.
213-
- `scripts/export_engineering_diary_assets.py` - Utility script for export engineering diary assets.
214203
- `scripts/export_official_results.py` - Utility script for export official results.
215204
- `scripts/extract_analysis_metrics.py` - Utility script for extract analysis metrics.
216205
- `scripts/extract_build_diary.py` - Utility script for extract build diary.
@@ -234,8 +223,6 @@ Generated from `scripts/registry.json` by `scripts/generate_script_index.py`.
234223
- `scripts/plot_build_diary.py` - Utility script for plot build diary.
235224
- `scripts/plot_build_diary_supplementary.py` - Utility script for plot build diary supplementary.
236225
- `scripts/plot_build_narrative.py` - Utility script for plot build narrative.
237-
- `scripts/plot_conversation_blog_svgs.py` - Utility script for plot conversation blog svgs.
238-
- `scripts/plot_csb_mcp_blog_figures.py` - Utility script for plot csb mcp blog figures.
239226
- `scripts/prepare_analysis_runs.py` - Utility script for prepare analysis runs.
240227
- `scripts/promote_agent_oracles.py` - Utility script for promote agent oracles.
241228
- `scripts/promote_blocked.py` - Utility script for promote blocked.
@@ -256,8 +243,6 @@ Generated from `scripts/registry.json` by `scripts/generate_script_index.py`.
256243
- `scripts/run_judge.py` - Utility script for run judge.
257244
- `scripts/run_missing_oracles.sh` - Utility script for run missing oracles.
258245
- `scripts/run_scaling_gap_oracles.sh` - Utility script for run scaling gap oracles.
259-
- `scripts/run_sg_local.sh` - Utility script for run sg local.
260-
- `scripts/run_sg_validation.py` - Utility script for run sg validation.
261246
- `scripts/scaffold_contextbench_tasks.py` - Utility script for scaffold contextbench tasks.
262247
- `scripts/scaffold_feature_tasks.py` - Utility script for scaffold feature tasks.
263248
- `scripts/scaffold_refactor_tasks.py` - Utility script for scaffold refactor tasks.

0 commit comments

Comments
 (0)