@@ -6,6 +6,63 @@ This repository contains **benchmark task definitions**, **evaluation configs**,
66
77---
88
9+ ## Quickstart (Public / First-Time Users)
10+
11+ ### Who this repo is for
12+
13+ - Researchers evaluating coding agents on realistic software engineering tasks
14+ - Practitioners comparing baseline vs MCP-enabled agent configurations
15+ - Contributors authoring new benchmark tasks or extending evaluation tooling
16+
17+ ### What you can do without Harbor
18+
19+ You can inspect task definitions, run validation and analysis scripts, and use the metrics/report pipeline on existing Harbor run outputs.
20+
21+ ``` bash
22+ git clone https://github.com/sjarmak/CodeContextBench.git
23+ cd CodeContextBench
24+
25+ # Fast repo sanity check (docs/config refs)
26+ python3 scripts/repo_health.py --quick
27+
28+ # Explore task-based docs navigation
29+ sed -n ' 1,120p' docs/START_HERE_BY_TASK.md
30+
31+ # Inspect available benchmark suites
32+ ls benchmarks
33+ ```
34+
35+ ### What requires Harbor (benchmark execution)
36+
37+ Running benchmark tasks requires:
38+
39+ - [ Harbor] ( https://github.com/laude-institute/harbor/tree/main ) installed and configured
40+ - Docker
41+ - Valid agent/runtime credentials used by your Harbor setup
42+ - A Max subscription (for the default harness path documented in this repo)
43+
44+ Recommended pre-run checks:
45+
46+ ``` bash
47+ python3 scripts/check_infra.py
48+ python3 scripts/validate_tasks_preflight.py --all
49+ ```
50+
51+ Then start with a dry run:
52+
53+ ``` bash
54+ bash configs/run_selected_tasks.sh --dry-run
55+ ```
56+
57+ ### First places to read
58+
59+ - ` docs/START_HERE_BY_TASK.md ` for task-oriented navigation
60+ - ` docs/CONFIGS.md ` for the 2-config evaluation matrix
61+ - ` docs/EVALUATION_PIPELINE.md ` for scoring and reporting outputs
62+ - ` docs/REPO_HEALTH.md ` for the pre-push health gate
63+
64+ ---
65+
966## Benchmark Suites (SDLC-Aligned)
1067
1168Eight suites organized by software development lifecycle phase:
@@ -170,6 +227,8 @@ For the full multi-layer evaluation pipeline (verifier, LLM judge, statistical a
170227
171228## Running with Harbor
172229
230+ This section assumes Harbor is already installed and configured. If not, start with the Quickstart section above and ` python3 scripts/check_infra.py ` .
231+
173232### SDLC Tasks
174233
175234The unified runner executes all 170 SDLC tasks across the 2-config matrix:
@@ -218,8 +277,6 @@ bash configs/run_selected_tasks.sh --selection-file configs/selected_mcp_unique_
218277
219278All runners support ` --baseline-only ` , ` --full-only ` , ` --task TASK_ID ` , and ` --parallel N ` flags.
220279
221- Requires [ Harbor] ( https://github.com/laude-institute/harbor/tree/main ) installed and configured with a Max subscription.
222-
223280---
224281
225282## Quality Assurance & Validation
0 commit comments