Skip to content

Commit 437f109

Browse files
sjarmakcursoragent
andcommitted
chore: clean up stale ralph dirs, orphaned docs, and add Cursor rules
- Remove all 20 completed ralph directories (all stories passed) - Remove root-level ralph state (prd.json, progress.txt) - Remove stale leaderboard snapshots (LEADERBOARD_RESULTS.md, leaderboard.json) - Archive 10 orphaned docs (zero references) to docs/archive/ - Update .gitignore: drop ralph-ent-governance exception, add tmp/cache/eval_reports - Add .cursor/rules/ with 10 skill files from ~/.claude/skills for Cursor IDE - Fix false-positive docs consistency check in EXTENSIBILITY.md Co-authored-by: Cursor <cursoragent@cursor.com>
1 parent f21632e commit 437f109

File tree

82 files changed

+1538
-7831
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

82 files changed

+1538
-7831
lines changed

.cursor/rules/agent-delegation.mdc

Lines changed: 139 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,139 @@
1+
---
2+
description: Agent delegation skills — route and delegate tasks to the best AI agent (Codex, Cursor, Gemini, Copilot) based on task type and complexity. Use when asked to delegate work to other AI agents.
3+
---
4+
5+
# Delegate Skill
6+
7+
Routes coding tasks to the optimal AI CLI agent and delegates execution.
8+
9+
## Usage
10+
```
11+
/delegate <task description>
12+
```
13+
14+
## Routing
15+
16+
Run the CLI router:
17+
```bash
18+
python3 ~/agent-skills/router-service/route_cli.py "<task>"
19+
```
20+
21+
Options: `--prefer-speed`, `--prefer-cost`, `--compact`, `--exclude codex cursor`
22+
23+
## Agent Capabilities
24+
25+
| Agent | Role | Best For | Model |
26+
|-------|------|----------|-------|
27+
| **copilot** | Opus Specialist | Complex reasoning, security analysis, deep reviews | claude-opus-4.5 |
28+
| **cursor** | Structural Work | Planning, architecture, refactoring, debugging, code review | --model auto |
29+
| **gemini** | Code Gen Engine | Code generation, research, documentation | gemini-3-pro |
30+
| **codex** | Misc Coding | Algorithms, utility scripts | gpt-5.2-codex |
31+
32+
## Task Type Routing
33+
34+
| Task Type | Agent | Why |
35+
|-----------|-------|-----|
36+
| planning/architecture | Cursor | Plan mode for structured planning |
37+
| code_review | Cursor | Ask mode for thorough review |
38+
| security_review | Copilot | Opus for deep security analysis |
39+
| refactoring | Cursor | Agent mode, repo-aware |
40+
| debugging | Cursor | Systematic debugging |
41+
| debugging_complex | Copilot | Race conditions, concurrency |
42+
| code_generation | Gemini | Primary code gen |
43+
| research/exploration | Gemini | Codebase analysis |
44+
| documentation | Gemini | Clear, structured docs |
45+
| algorithms | Codex | Algorithmic optimization |
46+
47+
## Execution Steps
48+
1. Route task via CLI
49+
2. Parse JSON response (selected_agent, confidence, reasoning)
50+
3. Inform user and IMMEDIATELY delegate via Task tool
51+
4. Wait for driver agent to complete (handles orchestration)
52+
5. Return compressed results
53+
54+
## Fallback
55+
- Default to gemini for research/analysis
56+
- Default to cursor for code modification
57+
58+
---
59+
60+
# Codex CLI Guide
61+
62+
## Command Templates
63+
```bash
64+
export PATH="/opt/homebrew/bin:$HOME/.local/bin:$PATH" && codex exec --skip-git-repo-check --sandbox <MODE> "<prompt>" 2>/dev/null
65+
```
66+
67+
## Sandbox Modes
68+
| Task Type | Mode |
69+
|-----------|------|
70+
| Analysis, Q&A | `--sandbox read-only` |
71+
| Create/edit files | `--sandbox workspace-write --full-auto` |
72+
| Network access | `--sandbox danger-full-access --full-auto` |
73+
74+
## Models
75+
- Default: `-m gpt-5.2-codex`
76+
- General: `-m gpt-5.2`
77+
- Reasoning: `--config model_reasoning_effort="xhigh|high|medium|low"`
78+
79+
---
80+
81+
# Cursor CLI Guide
82+
83+
**The CLI command is `agent`, NOT `cursor`.**
84+
85+
## Command Templates
86+
```bash
87+
export PATH="$HOME/.local/bin:$PATH" && agent -p "<prompt>" --model auto --output-format text
88+
```
89+
90+
## Modes
91+
- `ask` — read-only analysis
92+
- `agent` — full file editing
93+
- `plan` — planning without execution
94+
- Default: auto-selected by task type
95+
96+
## Models
97+
- `--model auto` (recommended), `--model gpt-5.2`, `--model gemini-3-flash`, `--model opus-4.5-thinking`
98+
99+
---
100+
101+
# Copilot CLI Guide
102+
103+
## Command Templates
104+
```bash
105+
export PATH="$HOME/.local/share/mise/installs/node/24.13.0/bin:$HOME/.local/bin:$PATH" && copilot -p "<prompt>" --allow-all-paths
106+
```
107+
108+
## Permission Modes
109+
| Task Type | Flags |
110+
|-----------|-------|
111+
| Analysis, Q&A | (none) |
112+
| File edits | `--allow-all-paths` |
113+
| URL access | `--allow-all-urls` |
114+
115+
## Models
116+
- Default: `--model claude-sonnet-4.5`
117+
- Fast: `--model claude-haiku-4.5`
118+
- Complex: `--model claude-opus-4.5`
119+
120+
---
121+
122+
# Gemini CLI Guide
123+
124+
## Command Templates
125+
```bash
126+
export PATH="$HOME/.local/share/mise/installs/node/24.13.0/bin:$HOME/.local/bin:$PATH" && NODE_OPTIONS="--no-warnings" gemini -p "<prompt>" --yolo
127+
```
128+
129+
## Modes
130+
- Read-only: no flags
131+
- File edits: `--yolo`
132+
- Controlled: `--approval-mode auto_edit`
133+
134+
## Models
135+
- Default: `gemini-2.5-pro`
136+
- Fast: `-m gemini-2.5-flash`
137+
138+
## Session Management
139+
- `--list-sessions`, `--resume latest`, `--resume <id>`

.cursor/rules/ccb-analysis.mdc

Lines changed: 197 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,197 @@
1+
---
2+
description: CCB analysis skills — compare configs, audit MCP usage, IR quality metrics, cost analysis, and trace evaluation. Use when analyzing benchmark results, comparing configurations, or investigating MCP impact.
3+
globs:
4+
- scripts/compare_configs.py
5+
- scripts/mcp_audit.py
6+
- scripts/ir_analysis.py
7+
- scripts/cost_report.py
8+
- scripts/audit_traces.py
9+
---
10+
11+
# Compare Configs
12+
13+
Compare results between agent configurations to find signal about MCP tool impact.
14+
15+
## Steps
16+
17+
### 1. Run the comparison
18+
```bash
19+
cd ~/CodeContextBench && python3 scripts/compare_configs.py --format json
20+
```
21+
22+
### 2. Present results as tables
23+
24+
**Overall pass rates** by config, **divergence analysis** (stable, all-fail, divergent), and **divergent task detail table**.
25+
26+
Focus on: biggest winner, MCP helps, MCP hurts, all-fail tasks.
27+
28+
### 3. MCP-conditioned analysis (optional)
29+
30+
```bash
31+
python3 scripts/mcp_audit.py --paired-only --json --verbose 2>/dev/null
32+
```
33+
34+
Separates used-MCP vs zero-MCP tasks. Present reward delta table by intensity bucket.
35+
36+
### Variants
37+
```bash
38+
python3 scripts/compare_configs.py --suite ccb_pytorch --format json
39+
python3 scripts/compare_configs.py --divergent-only --format json
40+
python3 scripts/compare_configs.py --format table
41+
```
42+
43+
---
44+
45+
# MCP Audit
46+
47+
Analyze MCP (Sourcegraph) tool usage across benchmark runs.
48+
49+
## What This Does
50+
51+
`scripts/mcp_audit.py`:
52+
1. Collects `task_metrics.json` from paired_rerun batches
53+
2. Pairs baseline vs sourcegraph_full tasks
54+
3. Classifies by MCP usage: zero-MCP vs used-MCP (light/moderate/heavy)
55+
4. Computes reward and time deltas conditioned on actual MCP usage
56+
5. Identifies negative flips
57+
58+
## Steps
59+
60+
### 1. Run the audit
61+
```bash
62+
cd ~/CodeContextBench && python3 scripts/mcp_audit.py --json --verbose 2>/dev/null
63+
```
64+
65+
### 2. Present key findings
66+
67+
Tables: Overview, per-benchmark MCP adoption, reward deltas (used-MCP only), timing deltas.
68+
69+
### 3. Investigate zero-MCP tasks
70+
71+
Classify: trivially local, explicit file list, full local codebase, both configs failed, agent confusion.
72+
73+
### 4. Check for negative flips
74+
75+
Tasks where baseline passes but SG_full fails.
76+
77+
### 5. MCP tool distribution
78+
79+
Show which tools are most/least used.
80+
81+
### 6. Summary and recommendations
82+
83+
MCP value, MCP risk, optimization opportunities, cost-benefit.
84+
85+
### Variants
86+
```bash
87+
python3 scripts/mcp_audit.py --all-runs --json --verbose
88+
python3 scripts/mcp_audit.py --verbose # text output
89+
```
90+
91+
### Key Technical Notes
92+
- Transcript-first extraction: Tool counts from `claude-code.txt`, NOT `trajectory.json`
93+
- Paired reruns: BL and SF concurrent on same VM
94+
- MCP tool name variants: `sg_` prefix or not, script handles both
95+
96+
---
97+
98+
# IR Analysis
99+
100+
Measure how well agents find the right files, comparing baseline vs MCP retrieval against ground truth.
101+
102+
## Steps
103+
104+
### 1. Ensure ground truth is built
105+
```bash
106+
cd ~/CodeContextBench && python3 scripts/ir_analysis.py --build-ground-truth
107+
```
108+
109+
### 2. Run the IR analysis
110+
```bash
111+
cd ~/CodeContextBench && python3 scripts/ir_analysis.py --json 2>/dev/null
112+
```
113+
114+
### 3. Present key findings
115+
116+
Per-benchmark IR scores, overall aggregates, statistical tests.
117+
118+
Key metrics: file recall, MRR, context efficiency, P@K.
119+
120+
### Variants
121+
```bash
122+
python3 scripts/ir_analysis.py --per-task --json 2>/dev/null
123+
python3 scripts/ir_analysis.py --suite ccb_swebenchpro 2>/dev/null
124+
```
125+
126+
### Ground Truth Sources
127+
128+
| Benchmark | Strategy | Confidence |
129+
|-----------|----------|:----------:|
130+
| SWE-bench Pro | Patch headers | high |
131+
| PyTorch | Diff headers | high |
132+
| K8s Docs | Directory listing | high |
133+
| Governance/Enterprise | Test script paths | medium |
134+
| Others | Instruction regex | low |
135+
136+
---
137+
138+
# Cost Report
139+
140+
Analyze token usage and estimated cost across benchmark runs.
141+
142+
## Steps
143+
```bash
144+
cd ~/CodeContextBench && python3 scripts/cost_report.py
145+
```
146+
147+
Shows: total cost/tokens/hours, per suite/config breakdown, config cost comparison, top 10 most expensive tasks.
148+
149+
### Variants
150+
```bash
151+
python3 scripts/cost_report.py --suite ccb_pytorch
152+
python3 scripts/cost_report.py --config sourcegraph_full
153+
python3 scripts/cost_report.py --format json
154+
```
155+
156+
---
157+
158+
# Evaluate Traces
159+
160+
Comprehensive evaluation of benchmark run traces: data integrity, output quality, efficiency analysis.
161+
162+
## Phases
163+
164+
### Phase 1: Scope Selection
165+
- MANIFEST: `runs/official/MANIFEST.json`
166+
- Audit script: `python3 scripts/audit_traces.py [--json] [--suite X] [--config X]`
167+
168+
### Phase 2: Data Integrity
169+
- MCP adoption validation (transcript-first, check both `sg_` prefix variants)
170+
- Baseline contamination check (zero `mcp__sourcegraph` calls)
171+
- Infrastructure failure detection (zero-token, crash, null-token H3 bug)
172+
- Dedup integrity
173+
174+
### Phase 3: Output Quality
175+
- Per-suite reward analysis
176+
- Cross-config comparison (matched tasks)
177+
- Task-level quality patterns (MCP helps/hurts/neutral)
178+
179+
### Phase 4: Efficiency
180+
- Token usage and cost estimates
181+
- Wall clock time deltas
182+
- MCP tool distribution
183+
- Cost-effectiveness ratios
184+
185+
### Phase 5: Synthesis
186+
Write report to `docs/TRACE_AUDIT_<date>.md`.
187+
188+
## Known Patterns
189+
1. Zero-token (int 0) = auth failures
190+
2. Null-token + no trajectory + <=5 lines = crash failures
191+
3. Null-token + valid rewards = H3 token-logging bug (not failures)
192+
4. MCP distraction on TAC
193+
5. Deep Search unused (~1%)
194+
6. SWE-Perf regression under SG_base
195+
7. Subagent MCP calls hidden in trajectory.json (only in claude-code.txt)
196+
8. Zero-MCP is ~80% rational
197+
9. Monotonic MCP intensity-reward: Light +2.2%, Moderate +3.6%, Heavy +6.1%

0 commit comments

Comments
 (0)