Feat/benchmarking system by bmwdroch · Pull Request #53 · cosai-oasis/project-codeguard

bmwdroch · 2026-04-14T13:00:14Z

No description provided.

Benchmark framework that measures how CodeGuard rules affect the security quality of code written by AI agents. Architecture: - Docker container per scenario (OpenCode + qwen/qwen3-coder-next) - Each scenario runs twice: with and without CodeGuard skills - LLM judge (gpt-5.4-mini via OpenRouter) evaluates full diff - Async orchestration with configurable parallelism (up to 10) Components: - benchmarks/models.py: Pydantic models with token usage tracking - benchmarks/config.py: YAML-first config, .env only for API creds - benchmarks/orchestrator.py: asyncio + docker SDK - benchmarks/judge.py: holistic security review of entire diff - benchmarks/report.py: per-scenario and per-category aggregation - benchmarks/run.py: CLI with --dry-run, --scenario, --runs, --parallel - benchmarks/docker/: Dockerfile + entrypoint for container lifecycle - benchmarks/scenarios/javavulnerablelab.yaml: 20 realistic coding tasks across 10 vulnerability categories Usage: python -m benchmarks.run --dry-run python -m benchmarks.run --scenario jvl-feat-001 --runs 1 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Use pre-downloaded opencode binary (COPY) instead of curl in Dockerfile to work around flaky GitHub access from Docker build - Fix model format: openrouter/qwen/qwen3-coder (not qwen3-coder-next) - Pass OPENROUTER_API_KEY env var to containers (what opencode expects) - Add opencode.json provider config step in entrypoint - Use `opencode run -m ... --dangerously-skip-permissions --format json` for non-interactive execution with JSON output Tested: 2 scenarios × 2 runs = 8 containers, all succeeded. Agent writes code, judge finds CWE-89/CWE-79 vulnerabilities. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- entrypoint.sh: captures opencode JSON event stream as trace.json, separates stdout (JSON events) and stderr, fixes token parsing to use opencode's format ("input"/"output" not "prompt_tokens") - orchestrator.py: extracts trace.json from containers, passes BENCH_DEBUG env var - run.py: --debug flag saves per-container artifacts to results/debug/: {scenario}_{mode}_{run}_trace.json (full event stream) {scenario}_{mode}_{run}_agent.log (raw opencode output) {scenario}_{mode}_{run}.diff (git diff) - models.py: ContainerResult.agent_trace field for JSON events Tested: debug traces show full agent workflow (glob→read→write→text) with token counts per step and cost tracking. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

OpenCode does not scan .opencode/skills/ automatically. The only way to inject security rules is through opencode.json: {"agent": {"build": {"instructions": "...rules content..."}}} entrypoint.sh now concatenates SKILL.md + all 23 rule files into a single instructions blob, JSON-escapes it via python, and writes opencode.json. This lands in the agent's system prompt. The .opencode/skills/ file copy approach is removed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

In with_skills mode, append "Before writing any code, check if there are security skills installed and apply them" to the user prompt. This triggers the agent to call the built-in `skill` tool which loads CodeGuard rules from .opencode/skills/. without_skills mode gets the original prompt unchanged — clean baseline. Results: with hint, agent calls skill("software-security") first, then uses PreparedStatement for SQL queries. Without hint/skills, agent uses plain Statement with string concatenation. Model changed to openrouter/qwen/qwen3.6-plus (generates valid Java). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

3 retries with delays [5s, 15s, 30s] for ConnectError, ReadTimeout, and 5xx errors. 4xx errors (402 etc.) fail fast. Benchmark result (qwen3.6-plus, 2 scenarios × 2 runs): With skills: 4.5 avg score Without skills: 0.0 avg score Delta: +4.5 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Judge now uses CVSS v3.1 methodology with reference scores for common CWE types. Each vulnerability tagged with CVSS base score. security_score = highest CVSS of all vulnerabilities found. Benchmark result (qwen3.6-plus, 2 scenarios × 2 runs): jvl-feat-001 (SQL): With=0.5 Without=10.0 Delta=-9.5 jvl-feat-005 (XSS): With=2.5 Without=10.0 Delta=-7.5 OVERALL: With=1.5 Without=10.0 Delta=-8.5 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- config.py: fetch_model_pricing() queries /v1/models for per-token rates - models.py: ModelPricing + cost fields in UsageSummary - report.py: calculates agent/judge/total cost from tokens × pricing - run.py: fetches pricing at startup, displays per-1M-token rates Example output (1 scenario, 1 run): Agent: 169,842 in + 2,081 out = 171,923 $0.0593 Judge: 4,204 in + 577 out = 4,781 $0.0057 Total: 176,704 tokens $0.0650 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- run.py: rich progress bars for container and judge phases - --judge-only: re-judge from saved debug/ artifacts without re-running containers (loads diffs, traces, usage from results/debug/) - max_parallel bumped to 30 (CPU/RAM usage was <10% at 10 workers) - rich added to optional dependencies Usage: python -m benchmarks.run # full run with progress bars python -m benchmarks.run --judge-only # re-judge saved results Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Introduced an experimental benchmarking harness for evaluating the security impact of Project CodeGuard on AI-generated code. - Added a new section in the README and index documentation to explain the benchmarking system and its usage. - Updated mkdocs.yml to include a navigation link for the new benchmarking documentation. - Enhanced the judge API to return detailed logs alongside verdicts for better debugging and auditing. This commit lays the groundwork for measuring the effectiveness of security skills in AI coding tasks, providing structured results and insights.

CLAassistant · 2026-04-14T13:00:26Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

- Changed directory references from `software-security` to `secure-coding` in configuration and orchestrator files to reflect the new skill pack. - Updated entrypoint script to install the `secure-coding` skill pack and adjusted related paths. - Revised scenarios in `javavulnerablelab.yaml` to replace outdated rule references with the new naming convention. - Enhanced the benchmarking documentation to clarify the role of the `secure-coding` skill in evaluating AI-generated code security. - Added README and SKILL.md files for the new `secure-coding` skill, detailing its purpose and usage. These changes ensure consistency across the codebase and improve the clarity of the secure coding practices being implemented.

- Introduced new README and SKILL.md files for the `secure-coding-ru` skill, outlining its purpose and usage in AI-assisted code generation and review. - Added multiple rules focused on secure coding practices, including `always-no-hardcoded-secrets.md`, `always-crypto-algorithms.md`, and `always-certificate-hygiene.md`, among others. - Each rule provides detailed guidelines on cryptographic algorithms, certificate hygiene, and the handling of sensitive data, ensuring comprehensive coverage of security practices. - Enhanced the overall documentation to support developers in implementing secure coding standards effectively. These additions aim to improve the security posture of AI-generated code by providing clear, actionable guidelines for developers.

qwazqwaz and others added 11 commits April 14, 2026 12:25

Fix rich markup error on Windows paths

62d7044

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

qwazqwaz added 2 commits April 15, 2026 14:56

bmwdroch closed this Apr 15, 2026

bmwdroch deleted the feat/benchmarking-system branch April 15, 2026 13:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/benchmarking system#53

Feat/benchmarking system#53
bmwdroch wants to merge 13 commits intocosai-oasis:mainfrom
bmwdroch:feat/benchmarking-system

bmwdroch commented Apr 14, 2026

Uh oh!

CLAassistant commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

bmwdroch commented Apr 14, 2026

Uh oh!

CLAassistant commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants